多标记中文问句分类研究多标记中文问句
发布时间:2018-09-13 14:56
【摘要】:当前,逐渐被大众接收并广泛使用的一种新颖的网络应用被称为社区问答,英文名为(Community—basedQuestion—Answering, CQA)。为大众所熟知的问答系统有,新浪爱问,百度知道,类似的还有雅虎知识堂,知乎等等。问答系统的共同特点就是,使用者可以帮助解答他人所提的问题,与此同时使用者也还可以将自己的问题提交由他人来回答,并且可以依据他人的回答给出相应的评价。问答系统就是一个庞大的知识海洋,它的内容,也就是问题,都是经过长年累月所积累的用户生活中各个方面问题,问题不仅多而且范围还比较广。社区问答对使用者的提问,在线的搜集相关类似的问题并作出回答,将一些相关联的问题推送给使用者,其最终目标是将与使用者提问问题有直接关联的问答反馈给使用者。可以总结认为,社区问答这种可以相互交互的问答模式不是仅仅将某一个问题的回答反馈给使用者,而是将与所提问题相关的一连串的信息反馈给使用者。问句的解析,答案的提取以及信息的搜索为社区问答系统的主要的三个组成部分,因此问答系统中的关键性问题就是,第一点,在问句解析的过程中,怎么样去深刻的了解使用者所提交的问句的真正的含义,第二点,在信息搜索的过程中,怎么样去把与所提问题相关的信息找出来,第三点,在答案的提取过程中,怎么样去精准的把回答从相关的信息中提出出来。中文问句也是具有其自身的特点,比如,中文问句比较短通常都超不过160个字符,因此也就使得中文问句的特征信息相对较稀疏,这也会使得中文问句对信息的概述信号较弱、噪音多等等诸多的问题。再者就是社区问答的中文问句中很多情况下都会有一些不是很规则的词语或是句子出现,如,日常生活的俗语、习惯性使用的缩写词、网络中使用的变形词,所以传统的文本预处理的效果以及文本表示方法的性能也一定程度收到了影响。中文问句还具有多义性的特点,问句的多义性指的是一条问句同时属于多个类别,如问句“从买房到装修需都要注意哪些事”,它既属于“购房置业”类也属于“家居装修”类。因此,本文针对中文问句的特征稀疏及多义性展开研究,经过不断的深入研究与反复的进行试验,对多标记中文问句分类研究取得以下成果:(1)本文中先是使用维基百科知识库来构建中文问句中词语的关联的概念集合,因为维基百科知识库中具有非常丰富的概念和链接等一系列的关联信息。然后再使用个个页面间的链接的相关关系量化概念间的语义关系。接着将通过维基百科知识库获取的相应词语的关联的概念集合并将其用作相应词语的扩展特征词集合。下一步就是扩展中文问句的特征通过词语间语义的关系,再经过消除歧义词进一步选取相应的概念,从而完成中文问句特征扩展,通过这种方式改善中文问句对其概念描述的精确性,同时也能达到对语义表达更进一步的丰富,一定程度上也减少了中文问句特征稀疏对分类效果的影响。(2)由于在多标记中文问句分类的过程中,传统的ML-kNN算法并没有很好的考虑到标记之间的关联性问题,因此本文基于ML-kNN基础上,改进出了ML-CQC多标记中文问句分类算法,充分将问句的类别标记相关性考虑到问句分类的过程中。本文改进出了的ML-CQC算法在使用最大后验概率来推断没有标记类别的中文问句所属类别时会将它附近的其他的类别的统计信息考虑进来。在此基础之上,再在利用已经分类得到的类别标记结果之间的相关性,迭代ML-CQC。与ML-kNN不同的是,本文改讲出的ML-CQC算法能够有效地利用标记相关性来改善和提升分类性能,实验表明经过特征扩展过的中文问句在ML-CQC算法上具有可行性与有效性。。(3)本文在ML-CQC算法的基础上再次改进出SML-CQC算法,其核心思想是通过计算出类别标记的正例与负例的比例s,通过对相应样例先验概率进行s方,以此来改善因类别标记的正例的样本数量过于少而导致的错误的分类的情况。
[Abstract]:Nowadays, a new network application which is gradually accepted and widely used by the public is called Community-based Question-Answering (CQA). The well-known question-answering system includes: Sina Love Question, Baidu Know, Yahoo Knowledge Hall, Know and so on. Question answering system is a vast ocean of knowledge, and its contents, that is, questions, are the life of users accumulated over the years. Community Question Answering (CBA) provides users with questions, collects and answers similar questions online, and pushes related questions to users. The ultimate goal is to feed back the questions directly related to the user's questions. In order to solve this problem, the community question answering (QA) model is not only to feed back the answer of a certain question to the user, but also to feed back a series of information related to the question to the user. The key problem in the system is, first, in the process of question parsing, how to deeply understand the real meaning of the questions submitted by users, second, in the process of information search, how to find out the information related to the questions raised, third, in the process of extracting answers, how to be accurate. Chinese question sentences also have their own characteristics. For example, Chinese question sentences are usually shorter than 160 characters, which makes the characteristic information of Chinese question sentences relatively sparse. This also makes the overview signal of Chinese question sentences to information weak, more noise and many other questions. In many cases, there will be some irregular words or sentences, such as common sayings in daily life, abbreviations used habitually, deformation words used in the network, so the effect of traditional text preprocessing and the performance of text representation methods have also been affected to a certain extent. The Chinese question also has the characteristic of polysemy. The polysemy of the question refers to a question which belongs to many categories at the same time, such as "what should we pay attention to from buying a house to decorating". It belongs to the category of "buying a house and buying a house" and "decorating a house". After continuous in-depth study and repeated experiments, the research on multi-marker Chinese question classification achieves the following results: (1) In this paper, we first use Wikipedia knowledge base to construct a set of related concepts of Chinese question words, because Wikipedia knowledge base has a very rich set of related letters such as concepts and links. Then we quantify the semantic relationship between concepts by using the correlation of links between pages. Then we use the concept set of Related words obtained by Wikipedia Knowledge Base as the extended feature set of corresponding words. The next step is to extend the features of Chinese question sentences through the semantic relationship between words, and then. After disambiguation, the corresponding concepts are further selected to complete the expansion of Chinese question features. By this way, the accuracy of Chinese question conceptual description can be improved, and the semantic expression can be further enriched. To some extent, the influence of sparse Chinese question features on the classification effect can be reduced. In the process of multi-marker Chinese question classification, the traditional ML-kNN algorithm does not take into account the relevance between tags very well. Therefore, based on ML-kNN, this paper improves the ML-CQC multi-marker Chinese question classification algorithm, fully considering the relevance of the class tags in the process of question classification. The ML-CQC algorithm takes into account the statistical information of other classes in the vicinity of unmarked categories when it uses the maximum posteriori probability to infer which category the Chinese question sentence belongs to. ML-CQC algorithm can effectively use marker correlation to improve and improve the classification performance. Experiments show that the feature-extended Chinese question is feasible and effective in ML-CQC algorithm. (3) This paper improves the SML-CQC algorithm again on the basis of ML-CQC algorithm, its core idea is to calculate the positive and negative of the class marker. In order to improve the classification errors caused by too few samples of the labeled samples, the ratio s of the corresponding samples is used to carry out the s-square of the prior probability of the corresponding samples.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
,
本文编号:2241502
[Abstract]:Nowadays, a new network application which is gradually accepted and widely used by the public is called Community-based Question-Answering (CQA). The well-known question-answering system includes: Sina Love Question, Baidu Know, Yahoo Knowledge Hall, Know and so on. Question answering system is a vast ocean of knowledge, and its contents, that is, questions, are the life of users accumulated over the years. Community Question Answering (CBA) provides users with questions, collects and answers similar questions online, and pushes related questions to users. The ultimate goal is to feed back the questions directly related to the user's questions. In order to solve this problem, the community question answering (QA) model is not only to feed back the answer of a certain question to the user, but also to feed back a series of information related to the question to the user. The key problem in the system is, first, in the process of question parsing, how to deeply understand the real meaning of the questions submitted by users, second, in the process of information search, how to find out the information related to the questions raised, third, in the process of extracting answers, how to be accurate. Chinese question sentences also have their own characteristics. For example, Chinese question sentences are usually shorter than 160 characters, which makes the characteristic information of Chinese question sentences relatively sparse. This also makes the overview signal of Chinese question sentences to information weak, more noise and many other questions. In many cases, there will be some irregular words or sentences, such as common sayings in daily life, abbreviations used habitually, deformation words used in the network, so the effect of traditional text preprocessing and the performance of text representation methods have also been affected to a certain extent. The Chinese question also has the characteristic of polysemy. The polysemy of the question refers to a question which belongs to many categories at the same time, such as "what should we pay attention to from buying a house to decorating". It belongs to the category of "buying a house and buying a house" and "decorating a house". After continuous in-depth study and repeated experiments, the research on multi-marker Chinese question classification achieves the following results: (1) In this paper, we first use Wikipedia knowledge base to construct a set of related concepts of Chinese question words, because Wikipedia knowledge base has a very rich set of related letters such as concepts and links. Then we quantify the semantic relationship between concepts by using the correlation of links between pages. Then we use the concept set of Related words obtained by Wikipedia Knowledge Base as the extended feature set of corresponding words. The next step is to extend the features of Chinese question sentences through the semantic relationship between words, and then. After disambiguation, the corresponding concepts are further selected to complete the expansion of Chinese question features. By this way, the accuracy of Chinese question conceptual description can be improved, and the semantic expression can be further enriched. To some extent, the influence of sparse Chinese question features on the classification effect can be reduced. In the process of multi-marker Chinese question classification, the traditional ML-kNN algorithm does not take into account the relevance between tags very well. Therefore, based on ML-kNN, this paper improves the ML-CQC multi-marker Chinese question classification algorithm, fully considering the relevance of the class tags in the process of question classification. The ML-CQC algorithm takes into account the statistical information of other classes in the vicinity of unmarked categories when it uses the maximum posteriori probability to infer which category the Chinese question sentence belongs to. ML-CQC algorithm can effectively use marker correlation to improve and improve the classification performance. Experiments show that the feature-extended Chinese question is feasible and effective in ML-CQC algorithm. (3) This paper improves the SML-CQC algorithm again on the basis of ML-CQC algorithm, its core idea is to calculate the positive and negative of the class marker. In order to improve the classification errors caused by too few samples of the labeled samples, the ratio s of the corresponding samples is used to carry out the s-square of the prior probability of the corresponding samples.
【学位授予单位】:昆明理工大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP391.1
,
本文编号:2241502
本文链接:https://www.wllwen.com/wenyilunwen/shinazhuanghuangshejilunwen/2241502.html