基于层次语义结构的流式文本数据挖掘
[Abstract]:As a basic way of human information communication, the text occupies an important position in unstructured data. Compared with other forms of data, text data is usually of high value, so the research on automatic analysis and mining of text data has always been a hot topic in the field of text data. At present, the growth of text data on the Internet is fast and continuously generated every minute, so it can be seen as a continuous stream of text. Compared with the traditional text data, the streaming text data has some new features: 1) Many of the data in the text stream are low-quality, more difficult to extract the effective semantic information; 2) the mode in the text stream is dynamically changed, and the mining technology is required to accurately capture the change. These features put forward new challenges to the existing text data mining technology. At present, the streaming text data mining technology has not been perfect, and the related algorithms for the above challenge are urgently needed. As a common data organization mode, the hierarchical structure not only can reflect the inherent relation of data more accurately, but also an important way to realize the adaptive method, while the self-adapting method can realize the changing mode in the automatic matching streaming data. In this paper, the hierarchy is applied to streaming text data mining. From three aspects, such as concept hierarchy construction, rare category detection and on-line topic detection, three methods are proposed in order to improve the performance of streaming text data mining. Finally, based on the above method, this paper presents a semi-supervised online hierarchical topic model for streaming text data mining. The specific contribution of this paper is as follows: 1) A multi-path concept hierarchy construction method based on composite semantic distance is proposed aiming at the problem that the existing concept hierarchy construction method does not standardize the extraction precision of semantic relation in short text in micro bo and user's comments. The composite semantic distance in the method combines the advantages of semantic dictionary distance and context distance, and guarantees the application range of the method and the accuracy of the acquired semantic relation. At the same time, an improved multi-path coherent clustering algorithm is proposed to construct the concept hierarchy. In contrast to traditional condensed polytypes, the multi-condensed poly (poly) can maintain the relative near-far relation between concept pairs. In addition, an improved concept hierarchy similarity criterion is proposed, which solves the multiple matching problems that may occur in its original form. the experimental results show that the similarity between the concept hierarchy generated by the method and the real concept hierarchy is the highest in all comparison methods. 2) aiming at the problem that a new concept or theme is found in the concept hierarchy or the theme layer of the text stream, A rare category detection method based on hierarchical density clustering is proposed. In social networks or news flows, new documents or emerging topics are found to be valuable and anomaly detection plays a key role in new data detection. In order to improve the existing detection methods, a semi-supervised density clustering algorithm based on relative distance constraint and kernel function is proposed in this paper. Compared with its original form, RKMS has stronger extensibility and is more suitable for the application scenarios of hierarchical clustering. Then based on RKMS, this paper presents a method of detecting rare category based on hierarchical structure. Compared with the prior similar method, the method has the advantages that the number of pre-specified categories is not needed, and the stepwise optimization of the model can be realized by combining the active learning and the semi-supervised learning. The experimental results show that this method is better than others in the case of linear mapping and non-linear mapping. 3) Aiming at the problem of detecting and tracking the subject from the continuous input text stream, an on-line hierarchical topic model is proposed. HONMF). Most of the existing online topic models organize the discovered topics in a flat manner, but each topic is treated as independent individuals that ignore the potential relationships between the topics, thus limiting the expression of these subject models. In order to solve the problem, this paper firstly extends the online dictionary learning method and proposes a hierarchical online sparse matrix decomposition method, which can generate the theme organized in hierarchical form. At the same time, this paper proposes a theme hierarchy control mechanism based on Topic Bandwidth (Mean Shift), which can adaptively determine the number and depth of theme nodes. In addition, this paper puts forward the criteria for detecting emerging themes and disappearing themes in the existing theme levels, and realizes the dynamic evolution of the thematic hierarchy based on these criteria. Experimental results show that HONMF can find more quality themes in shorter operating times and can track changes in subject structure. 4) In order to verify the overall effect of this study route and further improve the performance of HONMF, A semi-supervised hierarchy on-line theme detection framework (SSHONMF) based on semantic relations is proposed, which combines the research work described in this paper into a set of processes. The process firstly generates the concept hierarchy for the specific text mining task according to the semantic dictionary and the training document, and adjusts the original document matrix based on the semantic relation. Then it uses the HONMF to detect the subject level in the text stream, while selecting a thread document from the subject hierarchy based on the selection index in the rare category detection method described herein. Finally, it learns a new similarity measure based on the thread document and is used for subsequent HONMF processes. The experimental results show that SSHONMF is better than HONMF by combining the above-mentioned method, which proves the rationality and validity of the study route.
【学位授予单位】:浙江大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP391.1
【相似文献】
相关期刊论文 前10条
1 ;浅析大规模文本数据挖掘技术在媒体中的创新应用[J];中国传媒科技;2007年11期
2 齐彬;吕婷;;共现分析技术在生物医学信息文本数据挖掘中的应用[J];中华医学图书情报杂志;2009年03期
3 陈建平,侯昌波,王功文,吕鹏,朱鹏飞,曾敏,吴文;矿产资源定量评价中文本数据挖掘研究[J];物探化探计算技术;2005年03期
4 方群;;文本数据挖掘中的进化信息算法[J];舰船电子工程;2010年08期
5 孙学军;;Web文本数据挖掘技术及其在电子商务中的应用[J];菏泽学院学报;2011年02期
6 宋瑞祺;;Web文本数据挖掘关键技术及其在网络检索中的应用[J];山西财经大学学报(高等教育版);2007年S1期
7 蔡立斌;;文本数据挖掘技术在Web知识库中的应用研究[J];科技通报;2012年12期
8 徐龙玺,吴文武;基于Web的文本数据挖掘的研究[J];山东省农业管理干部学院学报;2005年04期
9 王伟强;高文;段立娟;;Internet上的文本数据挖掘[J];计算机科学;2000年04期
10 陈建丽;;基于XML的Web文本数据挖掘模型构建[J];电脑与电信;2008年09期
相关重要报纸文章 前1条
1 编译 刘光强 王娟;香港1823政府热线:让百姓畅所欲言[N];中国计算机报;2010年
相关博士学位论文 前1条
1 涂鼎;基于层次语义结构的流式文本数据挖掘[D];浙江大学;2016年
相关硕士学位论文 前3条
1 邹庆轩;基于关联规则的文本数据挖掘研究[D];西南石油大学;2006年
2 王礼刚;基于XML的Web文本数据挖掘研究[D];西南大学;2007年
3 刘列夫;文本数据挖掘在工程图文档中的应用[D];浙江大学;2006年
,本文编号:2274145
本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2274145.html