基于二次特征提取的中文文本抄袭检测方法

发布时间：2018-03-24 21:35

本文选题：抄袭检测　切入点：文本预处理　出处：《西南大学》2013年硕士论文

【摘要】：近年来,随着信息技术和通信网络的飞速发展,人们获取信息的方式从大量的物质介质转化为网络文档,这种发展给人们带来了方便的同时也给我们的生活和技术本身的发展起到负面的作用。相比于传统文件,电子文档更容易被非法复制,且文本抄袭现象出现在很多领域,如学术界,商业界等都已非常严重。为了维护高校正常教学秩序,保护知识产权,抑制抄袭现象的蔓延,文本抄袭检测技术的研究具有重要意义。目前文本抄袭检测研究领域中比较有效的检测系统有Siff, COPS和中国知网检测系统,但普遍存在检测准确率不高的问题。中文文本文本抄袭检测的主要思想是：首先对文本进行预处理,包括去掉文本中与文本检测无关的信息和文本分词；其次是提取文本特征；最后计算待测文本与源文本的相似度,若得到的相似度值较事先设定的阈值高,说明该待测文本有抄袭的嫌疑。文本预处理和特征提取是文本抄袭检测的研究重点和难点。文本围绕这两个方面开展研究,主要研究工作包括： 1、文本预处理：目前,大多针对中文的文本抄袭检测方法都是对文本进行简单的处理,未考虑中文文本的单字词与多字词特征,从而导致文本特征提取不全面的问题,致使检测准确率不高。针对此问题,提出一种合并整体词的文本预处理方法,在文本分词之后,根据各个词的前后语义关系,合并具有整体意义的词,以此作为文本预处理结果。实验表明,经过合并整体词后的文本,能减少后文中的计算次数,为特征提取提供更好的提取方案,从而提高检测准确率。 2、文本特征提取：特征提取是要选取能够代表文本特征的文本块。选出的文本块要求是能代表文本特征的信息,包括语义信息和一定的结构信息,使文本抄袭检测的准确率尽量高。但是现阶段的提取方法,提取的特征不全和特征数量太多,算法的计算次数多,时间复杂度高等问题。针对此类问题,我们提出将预处理之后的文本进行二次特征提取,提高特征的精确度和减小特征长度。主要采用数字指纹来表示文本信息,将所有的文本转化为数字指纹集合,统计各个指纹出现的频度,并将指纹集合利用匹配统计的相似度计算方法进行相似度计算。实验表明,本特征提取方法提取的特征能够精确地代表文本,且长度适中。 3、基于二次特征提取的中文文本抄袭检测方法：分别采用我们提出的合并整体词的文本预处理方法处理文本和二次特征提取方法提取本文特征,实现基于二次特征提取的中文文本抄袭检测方法。实验表明,该检测方法的检测准确率和查全率都有明显提高。
[Abstract]:In recent years, with the rapid development of information technology and communication network, the way people obtain information from a large number of material media into network documents, This development not only brings convenience to people, but also plays a negative role in the development of our life and technology itself. Compared with traditional documents, electronic documents are more easily copied illegally, and the phenomenon of text copying appears in many fields. For example, academic and business circles are already very serious. In order to maintain normal teaching order in colleges and universities, to protect intellectual property rights, and to curb the spread of plagiarism, The research of text plagiarism detection technology is of great significance. At present, the more effective detection systems in the field of text plagiarism detection are Siff, COPS and Chinese knowledge net detection system, but the detection accuracy is not high. The main ideas of text plagiarism detection in Chinese text are as follows: first, preprocessing the text, including removing the text information and text participle which are irrelevant to text detection; secondly, extracting the text features; Finally, the similarity between the text under test and the source text is calculated. It shows that the text under test is suspected of plagiarism. Text preprocessing and feature extraction are the focus and difficulty of text plagiarism detection. 1. Text preprocessing: at present, most of the text plagiarism detection methods for Chinese text are simple processing of the text, without considering the single-character and multi-word features of the Chinese text, which leads to the problem of incomplete text feature extraction. In order to solve this problem, a text preprocessing method is proposed to combine the whole words. After the text segmentation, according to the semantic relationship between the words before and after each word, we combine the words with the whole meaning. The experimental results show that the text can reduce the number of computations in the following text and provide a better extraction scheme for feature extraction, thus improving the accuracy of detection. 2. Text feature extraction: feature extraction is to select text blocks that can represent text features. The selected text blocks require information that represents text features, including semantic information and certain structural information. The accuracy of text plagiarism detection is as high as possible. However, in the present extraction methods, the feature extraction is incomplete and the number of features is too large, the algorithm has a lot of computation times and high time complexity and so on. In order to improve the accuracy of the features and reduce the length of the features, we propose to extract the pre-processed text by using the digital fingerprint to represent the text information, and to transform all the texts into the digital fingerprint set. The frequency of each fingerprint is counted and the similarity is calculated by using the similarity calculation method of matching statistics. The experiment shows that the feature extracted by this method can represent the text accurately and the length is moderate. 3. The Chinese text plagiarism detection method based on the quadratic feature extraction: the text preprocessing method proposed by us to combine the whole word and the second feature extraction method are used to extract the features of this paper, respectively. A Chinese text plagiarism detection method based on quadratic feature extraction is implemented. The experimental results show that the detection accuracy and recall rate of this detection method are obviously improved.
【学位授予单位】：西南大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.1

【参考文献】