基于层次语义结构的流式文本数据挖掘

发布时间：2018-10-16 10:56

【摘要】：文本作为一种人类信息交流的基本方式,在非结构化数据中占有极其重要的地位。与其他形式的数据相比,文本数据通常价值较高,因而对文本数据自动分析和挖掘方法的研究一直是计算机领域的一个热门话题。目前互联网上的文本数据增长十分迅速,且是每时每刻持续不断生成的,因此可将其看作是一条条连续的文本流。与传统文本数据相比,流式文本数据具有一些新的特点:1)文本流中的很多数据是低质量的,较难提取有效语义信息;2)文本流中的模式是动态变化的,对挖掘技术提出了准确捕捉这种变化的要求。以上这些特点对现有文本数据挖掘技术提出了新的挑战。目前流式文本数据挖掘技术尚未十分完善,急需提出针对以上挑战的相关算法。层次结构作为常见的数据组织方式,不仅能够更加精确的反映数据间的固有关系,并且是实现自适应方法的一种重要途径,而基于自适应方法可实现自动匹配流式数据中不断变化的模式。本文将层次结构应用到流式文本数据挖掘中,从概念层次构建、稀有类别检测和在线主题检测等三方面入手,提出了三种方法以期提高流式文本数据挖掘的性能。最后基于上述方法,本文提出了一种针对流式文本数据挖掘的半监督在线层次主题模型。本文具体贡献如下:1)针对现有概念层次构建方法在微博、用户评论等不规范短文本中语义关系提取精度较低的问题,提出了一种基于复合语义距离的多路概念层次构建方法。该方法中的复合语义距离结合了语义字典距离和上下文距离的优点,并且保证了方法的适用范围和所获取的语义关系的精度。同时,本文还提出一种改进的多路凝聚聚类算法用以构建概念层次。相对传统凝聚聚类而言,多路凝聚聚类能保持概念对间的相对远近关系。此外,本文还提出一种改进的概念层次相似度标准,该标准解决了其原始形式中可能出现的多次匹配问题。实验结果表明,该方法生成的概念层次与真实概念层次的相似度为所有对比方法中最高。2)针对从文本流的概念层次或主题层次中发现新概念或主题的问题,提出了一种基于层次密度聚类的稀有类别检测方法。在社交网络或新闻流中,发现新颖的文档或者新兴主题是很有价值的,异常检测在新颖数据检测中可发挥关键作用。为了改进现有检测方法,本文首先提出了一种基于相对距离约束和核函数的半监督密度聚类算法(Relative Comparison Kernel Mean Shift,RKMS)。与其原始形式相比,RKMS可扩展性更强,且更加适合层次聚类这种应用场景。然后本文基于RKMS提出了一种基于层次结构的稀有类别检测方法。与现有同类方法相比,该方法的优点是无需预先指定类别的数目,且可通过结合主动学习和半监督学习实现模型的逐步优化。实验结果表明,该稀有类别检测方法在使用线性映射和非线性映射的情况下均比其他方法表现更好。3)针对从持续输入的文本流中检测和跟踪主题的问题,提出了一种在线的层次主题模型(Hierarchical Online Non-negative Matrix Factorization,HONMF)。现有在线主题模型大多以扁平方式组织已发现的主题,但将每个主题视作互相独立的个体忽略了主题间的潜在关系,因而限制了这些主题模型的表达能力。针对该问题,本文首先对在线字典学习方法进行扩展并提出一种层次的在线稀疏矩阵分解方法,其可生成以层次形式组织的主题。同时,本文借鉴均值漂移(Mean Shift)聚类的思想提出一种基于主题带宽(Topic Bandwidth)的主题层次结构控制机制,其可自适应的决定主题节点的数目和主题层次的深度。此外,本文还提出在已有主题层次中检测新兴主题和消亡主题的标准,并基于这些标准实现主题层次结构的动态演化。实验结果表明,HONMF能够在更短的运行时间内发现更高质量的主题,并且可跟踪主题结构的变化。4)为了验证本文研究路线的整体效果和进一步提升HONMF的性能,提出了一种基于语义关系的半监督层次在线主题检测框架(Semantic Relation based Semi-supervised Hierarchical Online Non-negative Matrix Factorization,SSHONMF),其将本文前述研究工作整合融合到一套流程中。该流程首先根据语义词典和训练文档生成针对特定文本挖掘任务的概念层次,并基于其中的语义关系对原始文档矩阵进行调整。接着其会使用HONMF检测文本流中的主题层次,同时基于本文稀有类别检测方法中的选择指标从主题层次中选择出线索文档。最后,其将根据线索文档学习出新的相似度度量并用于后续的HONMF过程。实验结果表明,通过结合前述方法,SSHONMF的性能比HONMF有所提升,证明了本文研究路线的合理性和有效性。
[Abstract]:As a basic way of human information communication, the text occupies an important position in unstructured data. Compared with other forms of data, text data is usually of high value, so the research on automatic analysis and mining of text data has always been a hot topic in the field of text data. At present, the growth of text data on the Internet is fast and continuously generated every minute, so it can be seen as a continuous stream of text. Compared with the traditional text data, the streaming text data has some new features: 1) Many of the data in the text stream are low-quality, more difficult to extract the effective semantic information; 2) the mode in the text stream is dynamically changed, and the mining technology is required to accurately capture the change. These features put forward new challenges to the existing text data mining technology. At present, the streaming text data mining technology has not been perfect, and the related algorithms for the above challenge are urgently needed. As a common data organization mode, the hierarchical structure not only can reflect the inherent relation of data more accurately, but also an important way to realize the adaptive method, while the self-adapting method can realize the changing mode in the automatic matching streaming data. In this paper, the hierarchy is applied to streaming text data mining. From three aspects, such as concept hierarchy construction, rare category detection and on-line topic detection, three methods are proposed in order to improve the performance of streaming text data mining. Finally, based on the above method, this paper presents a semi-supervised online hierarchical topic model for streaming text data mining. The specific contribution of this paper is as follows: 1) A multi-path concept hierarchy construction method based on composite semantic distance is proposed aiming at the problem that the existing concept hierarchy construction method does not standardize the extraction precision of semantic relation in short text in micro bo and user's comments. The composite semantic distance in the method combines the advantages of semantic dictionary distance and context distance, and guarantees the application range of the method and the accuracy of the acquired semantic relation. At the same time, an improved multi-path coherent clustering algorithm is proposed to construct the concept hierarchy. In contrast to traditional condensed polytypes, the multi-condensed poly (poly) can maintain the relative near-far relation between concept pairs. In addition, an improved concept hierarchy similarity criterion is proposed, which solves the multiple matching problems that may occur in its original form. the experimental results show that the similarity between the concept hierarchy generated by the method and the real concept hierarchy is the highest in all comparison methods. 2) aiming at the problem that a new concept or theme is found in the concept hierarchy or the theme layer of the text stream, A rare category detection method based on hierarchical density clustering is proposed. In social networks or news flows, new documents or emerging topics are found to be valuable and anomaly detection plays a key role in new data detection. In order to improve the existing detection methods, a semi-supervised density clustering algorithm based on relative distance constraint and kernel function is proposed in this paper. Compared with its original form, RKMS has stronger extensibility and is more suitable for the application scenarios of hierarchical clustering. Then based on RKMS, this paper presents a method of detecting rare category based on hierarchical structure. Compared with the prior similar method, the method has the advantages that the number of pre-specified categories is not needed, and the stepwise optimization of the model can be realized by combining the active learning and the semi-supervised learning. The experimental results show that this method is better than others in the case of linear mapping and non-linear mapping. 3) Aiming at the problem of detecting and tracking the subject from the continuous input text stream, an on-line hierarchical topic model is proposed. HONMF). Most of the existing online topic models organize the discovered topics in a flat manner, but each topic is treated as independent individuals that ignore the potential relationships between the topics, thus limiting the expression of these subject models. In order to solve the problem, this paper firstly extends the online dictionary learning method and proposes a hierarchical online sparse matrix decomposition method, which can generate the theme organized in hierarchical form. At the same time, this paper proposes a theme hierarchy control mechanism based on Topic Bandwidth (Mean Shift), which can adaptively determine the number and depth of theme nodes. In addition, this paper puts forward the criteria for detecting emerging themes and disappearing themes in the existing theme levels, and realizes the dynamic evolution of the thematic hierarchy based on these criteria. Experimental results show that HONMF can find more quality themes in shorter operating times and can track changes in subject structure. 4) In order to verify the overall effect of this study route and further improve the performance of HONMF, A semi-supervised hierarchy on-line theme detection framework (SSHONMF) based on semantic relations is proposed, which combines the research work described in this paper into a set of processes. The process firstly generates the concept hierarchy for the specific text mining task according to the semantic dictionary and the training document, and adjusts the original document matrix based on the semantic relation. Then it uses the HONMF to detect the subject level in the text stream, while selecting a thread document from the subject hierarchy based on the selection index in the rare category detection method described herein. Finally, it learns a new similarity measure based on the thread document and is used for subsequent HONMF processes. The experimental results show that SSHONMF is better than HONMF by combining the above-mentioned method, which proves the rationality and validity of the study route.
【学位授予单位】：浙江大学
【学位级别】：博士
【学位授予年份】：2016
【分类号】：TP391.1

【相似文献】