基于海量查询日志的数据挖掘及用户行为分析

发布时间：2018-03-22 13:04

本文选题：海量日志　切入点：数据挖掘　出处：《北京邮电大学》2013年硕士论文　论文类型：学位论文

【摘要】：随着互联网和搜索引擎技术的飞速发展,Web中包含的信息不断增加,搜索引擎成为大多数用户为获取网络信息的首选。在用户与搜索引擎的交互过程中,产生了海量的查询日志,而且这些日志还在不断地增长。由于日志中蕴含了大量和用户相关的信息,成为很多公司为更好地了解并吸引更多用户的重点研究对象。利用分布式技术存储并计算海量日志,使得对查询日志的研究变得更加方便。如今各大互联网公司都越来越重视自己的查询日志,期望通过对这些日志进行及时、精确地分析和挖掘来发现隐藏在日志中的用户行为特征,以此来提高用户使用搜索引擎时的满意度,提升企业的市场竞争力。本文以海量查询日志作为处理对象,主要进行的工作有： (1)对日志预处理技术的研究。主要研究了数据清洗、用户识别、会话识别、路径补充和事务识别以及相关算法,并将分布式技术和算法相结合,实现了基于Hadoop的日志预处理过程,为后面数据挖掘做准备。 (2)设计用户日志挖掘系统。考虑到日志海量的特点,传统的数据存储和计算方法难以适用于搜索引擎用户行为分析中。针对此问题,本文提出基于MapReduce编程框架对海量日志进行挖掘的思想,根据日志中记录的用户查询词、点击的URL和标识用户身份的ID对用户行为进行建模,将用户行为用特征向量来表示,给出不同用户相似度的计算公式,分析了K-means算法分布式化的可行性并给出详细的分布式实践步骤。实验证明,该算法能够有效的对用户聚类,并在处理海量数据时表现出较好的性能。 (3)对用户行为进行分析。主要分析了日志量、用户量及两者的关系；用户查询词的数量、长度、字符组成、常用查询词；被点击的URL总量、URL的深度、常用URL；搜索引擎返回结果的顺序与用户点击的顺序之间的关系。经过对日志的多角度分析,得出用户行为的特征,从而为以后改善搜索引擎和用户之间的交互体验提供参考依据。
[Abstract]:With the rapid development of the Internet and search engine technology, the information contained in the Web is increasing, and the search engine has become the first choice for most users to obtain network information. In the process of interaction between users and search engines, massive query logs have been generated. And these logs are growing. Because they contain a lot of user-related information, they have become the focus of many companies to better understand and attract more users. It makes the research of query logs more convenient. Nowadays, all the major Internet companies are paying more and more attention to their own query logs, hoping to make these logs in a timely manner. In order to improve the users' satisfaction in using search engine and enhance the market competitiveness of enterprises, the user behavior characteristics hidden in the log are analyzed and mined accurately. This paper takes the massive query log as the processing object. The main work of this paper is as follows:. This paper mainly studies data cleaning, user identification, session identification, path complement, transaction identification and related algorithms, and combines distributed technology with algorithms. The process of log preprocessing based on Hadoop is implemented to prepare for data mining. 2) designing user log mining system. Considering the huge amount of logs, the traditional data storage and computing methods are difficult to be used in the behavior analysis of search engine users. In this paper, the idea of mining massive logs based on MapReduce programming framework is proposed. According to the user query words recorded in the log, the clicked URL and the ID identifying the user identity, the user behavior is modeled, and the user behavior is represented by the feature vector. The calculation formulas of different user similarity are given, the feasibility of distributed K-means algorithm is analyzed, and the detailed distributed practical steps are given. The experimental results show that the algorithm can effectively cluster users. And show good performance when dealing with massive data. Analysis of user behavior. This paper mainly analyzes the number of logs, the number of users and their relationship; the number, length, character composition, common query words of user query words; the total number of URLs clicked and the depth of URLs. The relationship between the order of the results returned by the search engine and the order in which the user clicks. Through the multi-angle analysis of the log, the characteristics of the user's behavior are obtained. So as to improve the interaction between search engines and users in the future to provide a reference basis.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP391.3;TP311.13

【参考文献】