多数据流频繁项集挖掘算法研究

发布时间：2018-05-15 23:35

本文选题：数据挖掘 + 多数据流　；参考：《山东师范大学》2017年硕士论文

【摘要】：随着互联网技术在众多领域飞速地发展,网络数据的存在形式也呈现出多样化的趋势。其中,数据流作为一种新型的数据形式已在众多应用领域广泛地出现。例如,传感器网络环境中的数据、金融应用中的财务数据和GPS定位系统所获取的地理位置等数据。面对无限、连续和高速的海量数据,传统的数据挖掘技术难以直接应用于发现海量数据流中的有效信息。因此,数据流挖掘问题具有重要的研究意义。本文将多数据流频繁项集挖掘算法作为研究对象。首先,阐述了课题的研究背景以及研究意义,同时概括总结了国内外关于该课题的研究现状。其次,阐述了在数据处理过程中所应用的相关技术。最后,提出了两种基于多数据流环境的频繁项集挖掘算法。本文的主要工作可分为以下三个方面:(1)研究了多数据流频繁项集挖掘算法的数据存储结构,设计了一种基于FP-Tree的压缩频繁模式树。本文对数据流的特点和表现形式进行了深入地分析研究,设计了一种基于字典序列的前缀树存储结构,并在该结构中引入了对数倾斜时间窗口模型。该窗口模型能够增量地更新、保留频繁项集的计数值,在一定程度上提高了内存空间的利用率以及算法的空间复杂度。(2)研究了多数据流协同频繁项集挖掘问题,改进了一种基于滑动窗口模型的多数据流协同频繁项集挖掘算法。本文引入了多数据流协同频繁项集挖掘问题,多数据流协同频繁项集是指一组对象在很短的时间内以伴随的状态频繁地出现在一条数据流或多条数据流中。首先,通过基于字节序列的滑动窗口挖掘算法发现数据流中的潜在频繁项集和频繁项集;其次,构建频繁模式树用以存储多数据流中的潜在频繁项集和频繁项集,并增量地更新树结构中对数倾斜时间表内对应项集出现的频数;最后,通过汇总分析得出多数据流中的协同频繁项集。(3)研究了分布式环境中的多数据流协同频繁项集挖掘算法,将多数据流协同频繁项集挖掘算法并行化计算。在当前的大数据背景下,数据流的规模呈现急剧增长的趋势,其到达速度非常快且对处理结果的实时性要求非常高。单个计算节点的计算能力难以承受规模如此巨大的数据。因此,传统的集中式频繁项集挖掘算法无法应对规模日益剧增的数据流。为了解决这一问题,本文采用了并行计算模型这一有效的途径,还设计了能够分布到不同计算节点上的分布式索引结构,能够高效地发现存在于分布式环境中多数据流的协同频繁项集。
[Abstract]:With the rapid development of Internet technology in many fields, the existing form of network data also presents a trend of diversification. As a new form of data, data flow has been widely used in many applications. For example, data in sensor network environment, financial data in financial applications and GPS positioning system In the face of infinite, continuous and high-speed mass data, the traditional data mining technology is difficult to directly apply to the discovery of effective information in the mass data stream. Therefore, the data stream mining problem has an important research significance. In this paper, the frequent item set mining algorithm of multi data flow is used as the research object. First, the topic is expounded. The research background and significance of the research are summarized, and the research status about this topic at home and abroad is summarized. Secondly, the related technologies used in the process of data processing are expounded. Finally, two kinds of frequent itemset mining algorithms based on multi data stream environment are proposed. The main work of this paper can be divided into three aspects: (1) many studies are made. The data stream frequent itemset mining algorithm is a data storage structure, and a FP-Tree based compression frequent pattern tree is designed. In this paper, the characteristics and forms of the data flow are deeply analyzed and studied. A prefix tree storage structure based on the dictionary sequence is designed, and the log skew time window model is introduced in this structure. The window model can be updated incrementally, retain the number of frequent itemsets, improve the utilization of memory space and the spatial complexity of the algorithm to a certain extent. (2) the problem of multi data stream co frequent itemset mining is studied, and a multi data stream cooperative frequent itemset mining algorithm based on sliding window model is improved. Multi data stream co frequent itemsets mining, multi data stream synergetic frequent itemsets are the frequent occurrence of a group of objects in a very short time in a data stream or multiple data streams. First, the potential frequent itemsets and frequent itemsets in the data stream are found through the sliding window mining algorithm based on the byte sequence. Secondly, the frequent pattern tree is constructed to store the potential frequent itemsets and frequent itemsets in the multi data stream, and incrementally update the frequency of the corresponding item set in the log sloping timetable in the tree structure. Finally, the synergetic frequent itemsets in the multi data stream are obtained by the summary analysis. (3) the multi data flow Association in the distributed environment is studied. With the frequent itemset mining algorithm, the multi data stream co frequent itemset mining algorithm is parallelized. In the current large data background, the scale of the data flow presents a rapid growth trend, its arrival speed is very fast and the real-time performance of the processing results is very high. The computing ability of single computing nodes is difficult to bear the size of such a huge amount. Therefore, the traditional centralized frequent itemset mining algorithm can not cope with the increasing scale of data flow. In order to solve this problem, this paper uses the parallel computing model as an effective way, and designs a distributed index structure that can be distributed to different computing nodes, and can efficiently find the distributed loop. Synergetic frequent itemsets of multiple data streams in the border.

【学位授予单位】：山东师范大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP311.13

【参考文献】