基于分布表示的跨语言跨任务自然语言分析

发布时间：2018-05-16 03:19

本文选题：自然语言处理 + 多语言　；参考：《哈尔滨工业大学》2017年博士论文

【摘要】：特征表示是统计机器学习的基础工作,也是影响机器学习系统性能的关键因素之一。在基于统计的自然语言处理研究中,最常见的特征表示是离散形式的符号表示,比如对于词的独热表示(One-Hot)以及对于文档的词袋表示(Bag-of-Words)等。这种表示方式直观简洁,易于计算,结合特征工程以及传统机器学习算法(如最大熵、支持向量机、条件随机场等),可以有效地应用于大部分自然语言处理的主流任务。另一种重要的特征表示机制称为分布表示,通常为连续、稠密、低维的向量表示,比如早期的潜在语义分析(Latent Semantic Analysis)以及近年来应用甚广的“特征嵌入”(Feature Embedding)方法等。近年来,特征的分布表示被广泛应用在基于深度学习的自然语言处理模型中。与符号表示相比,分布表示可以更自然地与学习能力较强的深度神经网络模型相结合,并通过逐层抽象的表示学习来获得更适用于具体任务的高层语义表示。这也是填补自然语言处理语义鸿沟的一种有效手段。更重要的,分布表示提供了一种通用的语义表示空间,为不同任务、不同语言、不同模态数据之间的信息交互构建了一座桥梁。这种语义表示上的通用性使得多源训练信息能够相互融合,进而起到知识迁移的作用。比如,从无标注的生文本中训练神经网络语言模型而得到的词汇分布表示,被证明能够有效地提升大多数自然语言处理主流任务的性能。本文正是利用分布表示的这些特点,尤其针对其在语义表示上的通用性,研究了分布表示在跨语言、跨数据类型以及跨任务知识迁移中的关键技术。主要包含以下几个方面:1.基于双语数据的词义分布表示学习。针对前人提出的词汇分布表示无法刻画一词多义现象的问题,本文提出利用双语数据中所蕴含的词义对齐信息来学习词义级的分布表示。一方面能够更完整地刻画词义信息,另一方面可以结合循环神经网络对单语数据进行词义消歧,进而服务于上层应用。2.基于分布表示的跨语言依存句法分析。对于世界上绝大多数自然语言,句法标注资源难以获取,且人工标注代价较高。因此,本文提出多语言分布表示学习的方法,将不同语言的词语表示在一个相同的向量空间之内,构成了句法知识在不同语言之间进行迁移的一座桥梁。进而利用资源丰富语言(如英语)的句法资源,来对资源稀缺语言进行依存句法分析。3.基于深度多任务学习的多类型树库迁移学习。对于句法分析而言,现有的依存树库多种多样,或来自不同语言、或采用不同的标注规范。本文提出基于多层次分布表示共享的深度多任务学习结构,能够有效地从不同类型的源句法树库(不同语言、不同标规范)中进行知识萃取,从而提升句法模型在目标树库上的分析精度。4.面向语义角色标注与关系分类的统一框架。不同任务之间往往存在一定的共性,比如语义角色标注与(实体)关系分类,它们都涉及对句子中的语义关系进行分析。本文提出一个统一的深度神经网络模型,将语义角色标注与(实体)关系分类任务进行融合,并采用深度多任务学习来提升目标任务上的性能。总的来说,本论文利用分布表示在语义表示上的通用性,深入地研究了其在跨语言、跨任务与跨数据类型学习上的应用,在词汇、句法、语义层面上显著地提升了不同任务的性能。我们期待这些研究成果可以进一步延展至更多类型的数据以及任务,甚至应用于跨领域分析,以进一步推动自然语言处理领域的发展。
[Abstract]:Feature representation is the basic work of statistical machine learning and one of the key factors affecting the performance of machine learning systems. In statistical based Natural Language Processing research, the most common features are symbolic representations of discrete forms, such as the single heat representation of words (One-Hot) and the word bag representation (Bag-of-Words) for a document (Bag-of-Words). This representation is intuitive and simple, easy to calculate, combined with feature engineering and traditional machine learning algorithms (such as maximum entropy, support vector machines, conditional random fields, etc.), which can be effectively applied to most of the mainstream tasks of Natural Language Processing. Another important feature representation mechanism is called distribution representation, usually continuous, dense and low dimensional. In recent years, the distribution representation of features has been widely used in the Natural Language Processing model based on depth learning. Compared with the symbolic representation, the distribution representation can be more natural than the symbolic representation, such as the early potential semantic analysis (Latent Semantic Analysis) and the most widely used "Feature Embedding" method in recent years. It is also an effective means to fill the Natural Language Processing semantic gap. More importantly, distributed representation provides a general semantic representation space for different tasks, In different languages, the information interaction between different modal data builds a bridge. The generality of this semantic representation enables multi source training information to integrate with each other, and thus plays the role of knowledge migration. For example, the expression of the vocabulary distribution obtained from the training of neural network models from annotated raw text is proved to be possible. The performance of most Natural Language Processing mainstream tasks is effectively enhanced. This article is using these characteristics of distribution representation, especially for the generality of semantic representation, to study the key technologies of distribution representation in cross language, cross data types and cross task knowledge migration. The main aspects are as follows: 1. based on bilingual data In this paper, we propose to use the word meaning aligned information contained in the bilingual data to learn the distribution of word meaning level. On the one hand, it can describe the word meaning information more completely, and on the other hand, it can combine the recurrent neural network to the single word. Language data disambiguate, and then serve the cross language dependency syntactic analysis based on the distribution representation of.2.. For the vast majority of the natural languages in the world, the syntactic annotation resources are difficult to obtain, and the cost of manual annotation is high. Therefore, this paper presents a method of multilingual distribution to express learning in different languages. Within the same vector space, it forms a bridge between the transfer of syntactic knowledge between different languages, and then uses the syntactic resources of rich language (such as English) to carry out dependency parsing on the resource scarce language (.3.) based on the multi class tree base migration learning based on deep multitask learning. The dependency tree library is varied, or from different languages, or with different annotation specifications. This paper proposes a deep multitask learning structure based on multilevel distribution for sharing, which can effectively extract knowledge from different types of source syntax tree libraries (different languages, different standard specifications), and thus improve the syntactic model in the target Tree Library. The analysis precision.4. is oriented to the unified framework of semantic role tagging and relation classification. There are often some commonalities between different tasks, such as semantic role tagging and (entity) relation classification. They all involve the analysis of semantic relations in sentences. Entity relationship classification tasks are fused, and deep multitask learning is used to improve the performance of the target task. In general, this paper makes use of the generality of the distribution representation in semantic representation, and deeply studies its application in cross language, cross task and cross data type learning, which is significantly raised on the lexical, syntactical and semantic level. The performance of different tasks is promoted. We expect that these research results can be further extended to more types of data and tasks, and even applied to cross domain analysis to further promote the development of the Natural Language Processing field.

【学位授予单位】：哈尔滨工业大学
【学位级别】：博士
【学位授予年份】：2017
【分类号】：TP391.1

【相似文献】