评分信息系统的特征选择方法及其应用

发布时间：2018-05-01 04:41

本文选题：特征选择 + 序关系　；参考：《西南石油大学》2017年硕士论文

【摘要】：考试是评估教学质量的重要手段。通过对试卷的分析,我们不仅可了解学生学习效果,同时还可以发现命题及组卷中的问题。这对于评价教学、规范命题等都具有现实指导意义。然而,现有的试卷评价指标只能反映试卷的统计特性,最终的评价结果单一:“好/不好”。特征选择方法能够有效地弥补现有试卷评价方法的短板,深入挖掘试卷数据中隐藏的规律,为教师提供决策支持。因此,本文提出了一个以试卷评分信息系统为基础,以启发式特征选择算法为核心的试卷数据挖掘方案。首先,定义了试卷评分信息系统的数据模型,其每一列表示一个评测项,如某一道题或某一个知识点,其每一行表示一个测试对象,如某考生或某调查对象等等。该模型建立了从测试对象的得分到该对象的名次的映射,从而进一步定义出序关系信息系统的数据模型,此模型存储了所有测试对象在评测项任意组合下的名次。其次,提出了一个严格的属性约简问题和两个宽松的特征选择子问题。属性约简问题以序关系信息系统作为输入,以属性子集为输出。该属性子集与总分的序关系保持一致为约束条件,以属性子集的基数最小为优化目标。然而,在实际的应用中,保持一致的序关系是非常严格的,这会导致问题无解,即找不到属性子集或者无法从评测项的全集中移除任何一个元素。因此,放松属性约简问题中的一致性约束,改为相似度约束,对于获得更加精简、有意义的特征子集是十分必要的。在特征选择问题中,约束条件是特征子集与总分的序关系之间的相似度满足一个由专家预先给定的阈值。其余的参数与属性约简问题一致。此外,由于放松了约束条件,特征选择问题的解可能有多个。因此,将特征选择问题分解成两个子问题:找到一个最优特征子集和找到所有最优特征子集。然后,针对特征选择问题中的相似度计算任务,提出了一个优势关系相似度,并改进了两个经典的曼哈顿相似度和余弦相似度。为了使得三种相似度具有可比性,首先给出了相似度的一般性质。其次,从理论上论证了这三种指标对一般性质的满足性,并给出了各个指标在取得最值(0和1)时的情况。最后,由于前文定义的特征选择问题不满足单调性,给最优解的求解带来了巨大困难。因此,提出了一个贪心算法来快速获得一个最优解。此外,改进了回溯算法来验证启发式算法的结果。最后,木文在真实的考试试卷数据上进行了丰富的实验。数据集由8个班级数据组成,并且获得了标准化数据集。实验结果表明:1)在当前场景(数据结构试卷分析)中,保持序关系一致的属性约减问题确实无法得到有意义的解;2)贪心算法在大部分情况下,都能够找到最优解,在原始数据集上的准确率比标准化数据集高;3)曼哈顿相似度最合理,优势关系相似度次之,余弦相似度最差;4)特征选择方法所得的结果是有意义的;5)基于特征选择的结果能够为教师提供有意义的建议。
[Abstract]:Examination is an important means to evaluate the quality of the teaching. Through the analysis of the examination papers, we can not only understand the students' learning effect, but also find the problems in the propositions and organizing papers. It has practical guiding significance for the evaluation of teaching and the standard propositions. However, the existing test evaluation index can only reflect the statistical characteristics of the test papers, and the final result. The result of the evaluation is single: "good / bad". The feature selection method can effectively make up the short board of the existing test paper evaluation method, excavate the hidden rules in the test paper and provide the decision support for the teachers. Therefore, this paper puts forward a test paper based on the test paper scoring information system and the heuristic feature selection algorithm as the core data. First, the data model of a test paper scoring information system is defined, each of which represents a test item, such as a question or a certain point of knowledge, and each line represents a test object, such as a candidate or an object of investigation, etc. the model establishes a mapping from the test to the object's name, and thus the result of the model. The step defines the data model of the order relation information system. This model stores the names of all the test objects under any combination of the evaluation items. Secondly, a strict attribute reduction problem and two loose feature selection subproblems are proposed. The character set is consistent with the order relation of the total score as the constraint condition, with the minimum base of the attribute subset as the optimization goal. However, in the practical application, maintaining a consistent order relationship is very strict, which leads to the problem insolvable, that is, to find a subset of the attribute or to remove any element from the complete set of the evaluation item. Therefore, relax The consistency constraint in the attribute reduction problem is changed to the similarity constraint. It is necessary to get a more streamlined and meaningful subset of features. In the feature selection problem, the constraint condition is the similarity between the characteristic subset and the order relation of the total score. The other parameters and attribute reduction are satisfied. In addition, there are many solutions to the problem of feature selection because of the relaxation of constraints. Therefore, the problem of feature selection is decomposed into two sub problems: finding a optimal subset and finding all the optimal subset. Then, a dominance relation similarity is proposed for the similarity calculation task in the feature selection problem. Two classic Manhattan similarity and cosine similarity are improved. In order to make the three similarity comparable, the general properties of the similarity are given first. Secondly, the satisfiability of the three indexes to the general properties is proved theoretically, and the conditions of each index in the maximum value (0 and 1) are given. Finally, due to the previous writing. The problem of semantic feature selection does not satisfy the monotonicity and brings great difficulties to the solution of the optimal solution. Therefore, a greedy algorithm is proposed to quickly obtain an optimal solution. In addition, the backtracking algorithm is improved to verify the results of the heuristic algorithm. Finally, the paper carries out a rich experiment on the actual test volume data. The data set is 8 The experimental results show that: 1) in the current scene (data structure test analysis), the attribute reduction problem with consistent order relationship does not have meaningful solutions; 2) the greedy algorithm can find the optimal solution in most cases, and the accuracy ratio on the original dataset is compared to the standard. The normalized data set is high; 3) the Manhattan similarity is the most reasonable, the dominance relation similarity is the second, the cosine similarity is the worst; 4) the results obtained by the feature selection method are meaningful; 5) the result based on the feature selection can provide meaningful suggestions for the teachers.

【学位授予单位】：西南石油大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：G424.74;TP311.13

【参考文献】