基于词语搭配知识和语法功能匹配的句法分析器

发布时间：2018-03-15 17:25

本文选题：词语搭配　切入点：知识库　出处：《南京师范大学》2013年博士论文　论文类型：学位论文

【摘要】：句法分析技术是信息处理领域的核心技术之一,也是难点所在。本文认为,词语搭配和句法结构之间有着密切的联系,将词语搭配知识加入到句法分析过程中有助于句法分析精度的提高。本文以词语搭配库和句法分析器互建的思想为指导,在研究中引入了哈工大、伯克利、斯坦福三所大学研制的句法分析器；在对比三个句法分析器分析结果的基础上,分别提出了两种大规模词语搭配的自动获取方法。第一种方法基于依存关系的句法分析,比对句法分析结果中的相同词对；第二种方法基于短语关系的句法分析,比对句法分析结果中的相同层次。实验表明,词语搭配的两种自动获取方法都能够有效的获取大规模词语搭配,其中基于短语关系的获取方法可以从14年新华社语料中获取得到约500万词语搭配型,抽样搭配精度约84%。使用自动获取得到的词语搭配资源,本文选取了四个搭配筛选条件用于词语搭配的优选,在搭配精度和搭配规模之间找到一个最佳的组合优选方式,并以此构建了一个包含十四个数据项的、百万搭配型数量的词语搭配知识库,知识库的抽样搭配精度超过90%。通过对知识库中的十四个数据项分别进行个体分析和关联分析,进一步挖掘了搭配类型、搭配次数、搭配距离等搭配相关属性之间的内在规律和联系。在建设完善了大规模、高质量的词语搭配资源之后,本文将词语搭配知识添加进基于语法功能匹配的句法分析算法,构建了一个基于词语搭配知识和语法功能匹配的句法分析器(CGFM)。使用新华社新闻语料作为开放测试语料,在单句法分析器的个体性能评测中,CGFM分析器开放测试的句法分析F值约为80%,添加了词语搭配知识之后的句法分析器相较之添加以前,句法分析的F值最多能有近4%的性能提升。在CGFM分析器、哈工大分析器、伯克利分析器、斯坦福分析器这四个分析器的横向性能评测中,CGFM分析器的表现优异,在短语分析评测和依存分析评测中均处于领先。
[Abstract]:Syntactic analysis is one of the core technologies in the field of information processing, and it is also a difficult point. This paper holds that there is a close relationship between collocation and syntactic structure. It is helpful to improve the precision of syntactic analysis by adding the knowledge of collocation to the process of syntactic analysis. Based on the comparison of the results of the three parsers, two methods of automatic collocation extraction are proposed. The first method is based on dependency analysis. The second method, which is based on phrase relation, compares the same level in the result of syntactic analysis. The two automatic methods of word collocation can obtain large scale collocations effectively. Among them, the method based on phrase relationship can obtain about 5 million word collocations from 14 years Xinhua corpus, and the sampling collocation accuracy is about 84%. In this paper, four collocation selection conditions are selected for the optimal selection of collocation, and a best combination method is found between collocation precision and collocation scale. Based on this, a collocation knowledge base of millions of collocation items is constructed. The sampling collocation accuracy of the knowledge base is over 90. Through the individual analysis and correlation analysis of 14 data items in the knowledge base, The internal rules and relationships between collocation types, collocation times, collocation distance and other collocation attributes are further explored. After the construction of large scale and high quality collocation resources, this paper adds collocation knowledge to syntactic analysis algorithm based on grammatical function matching. A syntactic analyzer based on collocation knowledge and grammatical function matching is constructed. The Xinhua News Corpus is used as the open test corpus. In the individual performance evaluation of a single parser, the open test of CGFM analyzer has a syntactic analysis F value of about 80. The F value of syntactic analysis can improve by nearly 4% at most. In the lateral performance evaluation of CGFM analyzer, Hart analyzer, Berkeley analyzer and Stanford analyzer, the performance of CGFM analyzer is excellent. In phrase analysis evaluation and dependency analysis evaluation are in the lead.
【学位授予单位】：南京师范大学
【学位级别】：博士
【学位授予年份】：2013
【分类号】：H087;H04

【参考文献】