基于屬性值集中度的分類數據聚類有效性內部評價指標

傅立偉; 武森

doi:10.13374/j.issn2095-9389.2019.05.015

基于屬性值集中度的分類數據聚類有效性內部評價指標

傅立偉,
武森

A new internal clustering validation index for categorical data based on concentration of attribute values

FU Li-wei,
WU Sen

摘要

摘要: 針對分類數據, 通過數據對象在屬性值上的集中程度定義了新的基于屬性值集中度的類內相似度(similarity based on concentration of attribute values, CONC), 用于衡量聚類結果中類內各數據對象之間的相似度; 通過不同類的特征屬性值的差異程度定義了基于強度向量差異的類間差異度(dissimilarity based on discrepancy of SVs, DCRP), 用于衡量兩個類之間的差異度.基于CONC和DCRP提出了新的分類數據聚類有效性內部評價指標(clustering validation based on concentration of attribute values, CVC), 它具有以下3個特點: (1)在評價每個類內相似度時, 不僅依靠類內各數據對象的特征, 還考慮了整個數據集的信息; (2)采用幾個特征屬性值的差異評價兩個類的差異度, 確保評價過程不丟失有效的聚類信息, 同時可以消除噪音的影響; (3)在評價類內相似度及類間差異度時, 消除了數據對象個數對評價過程的影響.采用加州大學歐文分校提出的用于機器學習的數據庫(UCI)進行實驗, 將CVC與類別效用(category utility, CU)指標、基于主觀因素的分類數據指標(categorical data clustering with subjective factors, CDCS)指標和基于信息熵的內部評價指標(information entropy, IE)等內部評價指標進行對比, 通過外部評價指標標準交互信息(normalized mutual information, NMI)驗證內部評價效果.實驗表明相對其他內部評價指標, CVC指標可以更有效地評價聚類結果.此外, CVC指標相對于NMI指標, 不需要數據集以外的信息, 更具實用性.

Abstract: Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, similarity/dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering validation indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity of objects in a cluster, was defined. Similarly, a new dissimilarity based on the discrepancy of characteristic attribute values, called DCRP, which can evaluate the dissimilarity between two clusters, was defined. A new internal clustering validation index, called CVC, which is based on CONC and DCRP, was proposed. Compared to other indices, CVC has three characteristics: (1) it evaluates the compactness of a cluster based on the information of the whole dataset and not only that of a cluster; (2) it evaluates the separation between two clusters by several characteristic attributes values so that the clustering information is not lost and the negative effects caused by noise are eliminated; (3) it evaluates the compactness and separation without influence from the number of objects. Furthermore, UCI benchmark datasets were used to compare the proposed index with other internal clustering validation indices (CU, CDCS, and IE). An external index (NMI) was used to evaluate the effect of these internal indices. According to the experiment results, CVC is more effective than the other internal clustering validation indices. In addition, CVC, as an internal index, is more applicable than the NMI external index, because it can evaluate the clustering results without external information.

HTML全文

參考文獻(22)

施引文獻

資源附件(0)