A new internal clustering validation index for categorical data based on concentration of attribute values
-
摘要: 針對分類數據, 通過數據對象在屬性值上的集中程度定義了新的基于屬性值集中度的類內相似度(similarity based on concentration of attribute values, CONC), 用于衡量聚類結果中類內各數據對象之間的相似度; 通過不同類的特征屬性值的差異程度定義了基于強度向量差異的類間差異度(dissimilarity based on discrepancy of SVs, DCRP), 用于衡量兩個類之間的差異度.基于CONC和DCRP提出了新的分類數據聚類有效性內部評價指標(clustering validation based on concentration of attribute values, CVC), 它具有以下3個特點: (1)在評價每個類內相似度時, 不僅依靠類內各數據對象的特征, 還考慮了整個數據集的信息; (2)采用幾個特征屬性值的差異評價兩個類的差異度, 確保評價過程不丟失有效的聚類信息, 同時可以消除噪音的影響; (3)在評價類內相似度及類間差異度時, 消除了數據對象個數對評價過程的影響.采用加州大學歐文分校提出的用于機器學習的數據庫(UCI)進行實驗, 將CVC與類別效用(category utility, CU)指標、基于主觀因素的分類數據指標(categorical data clustering with subjective factors, CDCS)指標和基于信息熵的內部評價指標(information entropy, IE)等內部評價指標進行對比, 通過外部評價指標標準交互信息(normalized mutual information, NMI)驗證內部評價效果.實驗表明相對其他內部評價指標, CVC指標可以更有效地評價聚類結果.此外, CVC指標相對于NMI指標, 不需要數據集以外的信息, 更具實用性.Abstract: Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, similarity/dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering validation indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity of objects in a cluster, was defined. Similarly, a new dissimilarity based on the discrepancy of characteristic attribute values, called DCRP, which can evaluate the dissimilarity between two clusters, was defined. A new internal clustering validation index, called CVC, which is based on CONC and DCRP, was proposed. Compared to other indices, CVC has three characteristics: (1) it evaluates the compactness of a cluster based on the information of the whole dataset and not only that of a cluster; (2) it evaluates the separation between two clusters by several characteristic attributes values so that the clustering information is not lost and the negative effects caused by noise are eliminated; (3) it evaluates the compactness and separation without influence from the number of objects. Furthermore, UCI benchmark datasets were used to compare the proposed index with other internal clustering validation indices (CU, CDCS, and IE). An external index (NMI) was used to evaluate the effect of these internal indices. According to the experiment results, CVC is more effective than the other internal clustering validation indices. In addition, CVC, as an internal index, is more applicable than the NMI external index, because it can evaluate the clustering results without external information.
-
圖 2 子集的屬性值分布圖. (a) 子集C1的屬性值分布圖; (b) 子集C2的屬性值分布圖; (c) 子集C3的屬性值分布圖; (d) 子集C4的屬性值分布圖; (e) 子集C5的屬性值分布圖; (f) 子集C6的屬性值分布圖
Figure 2. Distribution of attribute values for six subsets: (a) distribution of subset C1; (b) distribution of subset C2; (c) distribution of subset C3; (d) distribution of subset C4; (e) distribution of subset C5; (f) distribution of subset C6
表 1 6個子集的DV向量表
Table 1. Distribution vector for six subsets
屬性取值 子集 C1 C2 C3 C4 C5 C6 J 5 2 6 6 14 140 K 8 8 5 5 1 10 L 1 3 5 5 1 10 M 1 2 5 5 1 10 N 0 1 10 O 1 10 P 1 10 域值個數 4 4 5 4 7 7 數據對象個數 15 15 21 21 20 200 ACONC 23.17 16.5 11.4 0.5 48.29 4828.57 ACONCmax 112.5 112.5 176.4 220.5 114.29 11428.57 屬性的CONC 0.2059 0.1467 0.0646 0.0023 0.4225 0.4225 表 2 聚類內部有效性評價指標特征
Table 2. Characteristics of internal clustering validation indices
指標 指標特征 考慮整體分布 受類中心影響 受類個數影響 數據規模影響 其他特性 CVC指標 是 否 否 否 可消除噪音影響 CU指標 否 否 否 否 未考慮類間分離度 CDCS指標 否 是 是 是 不能消除噪音影響 IE指標 否 否 是 否 無法應對均勻效應;未考慮類間分離度 F指標 否 是 是 不能消除噪音影響;未考慮類間分離度 表 3 數據集X中的2個類的DV向量表
Table 3. Distribution vector for two clusters of X
類 屬性A1 屬性A2 域 取值個數 域 取值個數 C1 K1 3 K2 10 L1 5 L2 1 M1 12 M2 6 N2 3 C2 K1 30 K2 3 L1 7 L2 20 M1 3 M2 2 N2 15 表 4 實驗數據集及實驗參數設置
Table 4. Datasets and parameters for experiments
數據集 對象個數 維度數 原始類個數 缺失值 聚類算法參數范圍 CVC參數范圍 Small Soybean 47 35 4 No 2~10 1~8 Chess 3196 36 2 No 2~20 1~8 Mushroom 8124 21 2 Yes 2~20 1~8 表 5 Small Soybean數據集指標參數實驗結果平均值
Table 5. Average of clustering results for dataset Small Soybean with various parameters
類個數 CVC/10-1 CU CDCS/104 IE/10-2 NMI/10-1 ω=1 ω=2 ω=3 ω=4 ω=5 ω=6 ω=7 ω=8 2 1.4496 2.0501 2.3012 2.4380 2.5240 2.5830 2.6260 2.6587 1.3853 1.6332 24.6496 2.8077 3 1.2041 2.0856 2.5047 2.7448 2.8998 3.0079 3.0876 3.1488 1.5081 1.3996 11.2830 4.7695 4 1.2060 2.4120 3.0389 3.4110 3.6559 3.8288 3.9572 4.0564 1.4370 1.1956 6.2084 7.7629 5 1.1923 2.6662 3.4864 3.9869 4.3209 4.5591 4.7372 4.8753 1.4368 0.9450 2.8487 7.5003 6 0.9293 2.2764 3.0686 3.5628 3.8967 4.1365 4.3168 4.4572 1.2951 0.8182 2.2805 6.3634 7 0.8872 2.3472 3.2464 3.8179 4.2080 4.4900 4.7030 4.8693 1.2176 0.7458 1.6786 4.8953 8 0.7921 2.2404 3.1684 3.7678 4.1807 4.4807 4.7082 4.8863 1.1380 0.6320 1.1927 4.0136 9 0.7221 2.1663 3.1243 3.7521 4.1878 4.5060 4.7480 4.9380 1.0724 0.6412 0.9919 4.6183 10 0.6912 2.1857 3.2081 3.8867 4.3610 4.7089 4.9742 5.1830 1.0300 0.5666 0.8601 4.3090 表 6 Chess數據集指標參數實驗結果平均值
Table 6. Average of clustering results for dataset Chess with various parameters
類個數 CVC/10-1 CU/10-1 CDCS/108 IE/10-2 NMI/10-2 ω=1 ω=2 ω=3 ω=4 ω=5 ω=6 ω=7 ω=8 2 0.5404 0.7642 0.8578 0.9088 0.9409 0.9629 0.9789 0.9911 3.0616 9.1075 2.5055 0.2903 3 0.5418 0.9385 1.1271 1.2351 1.3049 1.3536 1.3894 1.4170 3.6258 8.3664 1.6218 1.3391 4 0.4359 0.8718 1.0984 1.2330 1.3215 1.3840 1.4304 1.4663 4.2637 3.9811 1.0895 1.6117 5 0.3779 0.8451 1.1050 1.2636 1.3695 1.4450 1.5015 1.5452 3.4602 3.8622 0.8798 1.2910 6 0.3511 0.8599 1.1592 1.3459 1.4720 1.5626 1.6307 1.6837 3.3915 3.1607 0.6528 1.8689 7 0.3173 0.8395 1.1611 1.3656 1.5051 1.6060 1.6821 1.7416 3.2804 2.6272 0.5062 1.6133 8 0.2812 0.7953 1.1247 1.3375 1.4840 1.5905 1.6713 1.7345 3.0207 2.3623 0.4474 2.2785 9 0.2663 0.7990 1.1523 1.3838 1.5445 1.6619 1.7512 1.8212 2.9194 1.9676 0.3970 2.3652 10 0.2348 0.7425 1.0898 1.3204 1.4815 1.5996 1.6898 1.7607 2.7256 1.7890 0.3700 2.0684 11 0.2214 0.7344 1.0952 1.3374 1.5077 1.6332 1.7292 1.8048 2.5746 1.8519 0.3187 1.6655 12 0.2100 0.7276 1.1009 1.3542 1.5333 1.6658 1.7673 1.8475 2.5005 1.5328 0.2896 2.3751 13 0.2022 0.7290 1.1179 1.3843 1.5737 1.7142 1.8221 1.9075 2.3970 1.3409 0.2637 3.4893 14 0.1871 0.7000 1.0866 1.3539 1.5449 1.6870 1.7964 1.8831 2.2667 1.2827 0.2394 3.0853 15 0.1737 0.6726 1.0563 1.3238 1.5157 1.6589 1.7694 1.8570 2.1884 1.1587 0.2150 2.4495 16 0.1634 0.6537 1.0377 1.3074 1.5018 1.6472 1.7596 1.8489 2.0798 1.2610 0.2086 1.5850 17 0.1592 0.6563 1.0523 1.3326 1.5354 1.6874 1.8052 1.8989 2.0524 0.9038 0.1913 2.1058 18 0.1520 0.6448 1.0438 1.3281 1.5346 1.6898 1.8102 1.9061 1.9830 0.8960 0.1743 3.1087 19 0.1457 0.6349 1.0371 1.3255 1.5358 1.6942 1.8172 1.9153 1.8953 0.7658 0.1625 2.2609 20 0.1438 0.6431 1.0596 1.3600 1.5798 1.7457 1.8747 1.9777 1.8364 0.7778 0.1660 3.8407 表 7 Mushroom數據集指標參數實驗結果平均值
Table 7. Average of clustering results for dataset Mushroom with various parameters
類個數 CVC/10-1 CU/10-1 CDCS/109 IE/10-1 NMI/10-1 ω=1 ω=2 ω=3 ω=4 ω=5 ω=6 ω=7 ω=8 2 1.130 1.598 1.793 1.8999 1.967 2.013 2.046 2.072 5.5295 5.4395 7.2900 1.513 3 1.066 1.847 2.218 2.4306 2.568 2.664 2.734 2.788 8.3726 2.9593 4.3948 1.568 4 0.852 1.704 2.147 2.4094 2.582 2.704 2.795 2.865 7.5823 3.3869 3.1450 0.462 5 0.847 1.895 2.477 2.8330 3.070 3.240 3.366 3.464 8.1918 1.8485 2.2165 2.307 6 0.726 1.779 2.397 2.7836 3.044 3.232 3.373 3.482 7.5752 1.7727 1.8483 2.100 7 0.640 1.692 2.341 2.7526 3.034 3.237 3.391 3.511 7.2510 1.0476 1.4268 2.145 8 0.596 1.685 2.383 2.7980 3.144 3.370 3.541 3.675 6.6447 1.1869 1.2363 1.374 9 0.519 1.556 2.244 2.6950 3.008 3.237 3.410 3.547 6.1161 0.9955 1.0594 0.964 10 0.480 1.517 2.227 2.6975 3.027 3.268 3.452 3.597 5.6701 0.8328 0.9542 1.708 11 0.437 1.450 2.162 2.6405 2.977 3.225 3.414 3.563 5.3697 0.5910 0.8196 1.634 12 0.414 1.436 2.172 2.6723 3.026 3.287 3.487 3.646 5.0980 0.4543 0.7651 1.190 13 0.406 1.465 2.247 2.7823 3.163 3.445 3.662 3.834 5.0402 0.4729 0.6505 1.547 14 0.381 1.424 2.211 2.7545 3.143 3.432 3.655 3.831 4.7980 0.3362 0.5851 1.135 15 0.354 1.369 2.150 2.6946 3.085 3.377 3.602 3.780 4.4620 0.3761 0.5321 1.276 16 0.343 1.373 2.179 2.7450 3.153 3.458 3.695 3.882 4.2832 0.2973 0.4927 1.077 17 0.335 1.381 2.215 2.8051 3.232 3.552 3.800 3.997 4.3443 0.1894 0.4490 1.681 18 0.311 1.320 2.137 2.7188 3.142 3.459 3.706 3.902 3.9230 0.3621 0.4277 1.584 19 0.309 1.347 2.200 2.8112 3.257 3.593 3.854 4.062 4.0827 0.1911 0.3681 1.429 20 0.287 1.285 2.118 2.7185 3.158 3.489 3.747 3.953 3.7305 0.1268 0.3684 1.715 259luxu-164 -
參考文獻
[1] Cornuéjols A, Wemmert C, Gan?arski P, et al. Collaborative clustering: why, when, what and how. Inf Fusion, 2017, 39: 81 [2] Yang H, Fu Y, Fan D. Influence of noisy features on internal validation of clustering. Comput Sci, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm楊虎, 付宇, 范丹. 噪音特征對聚類內部有效性的影響. 計算機科學, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm [3] Cheung Y M, Jia H. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognit, 2013, 46(8): 2228 doi: 10.1016/j.patcog.2013.01.027 [4] dos Santos T R L, Zárate L E. Categorical data clustering: what similarity measure to recommend?. Expert Syst Appl, 2015, 42(3): 1247 doi: 10.1016/j.eswa.2014.09.012 [5] Wu S, Jiang D D, Wang Q. HABOS clustering algorithm for categorical data. Chin J Eng, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm武森, 姜丹丹, 王薔. 分類屬性數據聚類算法HABOS. 工程科學學報, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm [6] Ilango V, Subramanian R, Vasudevan V. Cluster analysis research design model, problems, issues, challenges, trends and tools. Int J Comput Sci Eng, 2011, 3(8): 3064 [7] Huang D, Lai J H, Wang C D. Ensemble clustering using factor graph. Pattern Recognit, 2016, 50: 131 doi: 10.1016/j.patcog.2015.08.015 [8] Huang D, Wang C D, Lai J H, et al. Clustering ensemble by decision weighting. CAAI Trans Intell Syst, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm黃棟, 王昌棟, 賴劍煌, 等. 基于決策加權的聚類集成算法. 智能系統學報, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm [9] Zhao X W, Liang J Y, Dang C Y. Clustering ensemble selection for categorical data based on internal validity indices. Pattern Recognit, 2017, 69: 150 doi: 10.1016/j.patcog.2017.04.019 [10] Jaskowiak P A, Moulavi D, Furtado A C S, et al. On strategies for building effective ensembles of relative clustering validity criteria. Knowledge Inf Syst, 2016, 47(2): 329 doi: 10.1007/s10115-015-0851-6 [11] Yu Z W, Li L, Gao Y J, et al. Hybrid clustering solution selection strategy. Pattern Recognit, 2014, 47(10): 3362 doi: 10.1016/j.patcog.2014.04.005 [12] Li F J, Qian Y H, Wang J T, et al. Cluster's quality evaluation and selective clustering ensemble. ACM Trans Knowledge Discovery Data, 2018, 12(5): 60 [13] Larsen B, Aone C. Fast and effective text mining using linear-time document clustering // Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, 1999: 16 [14] Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: Part I. ACM SIGMOD Record, 2002, 31(2): 40 doi: 10.1145/565117.565124 [15] Halkidi M, Batistakis Y, Vazirgiannis M. Clustering validity checking methods: Part Ⅱ. SIGMOD Record, 2002, 31(3): 19 doi: 10.1145/601858.601862 [16] Fu L W, Wu S. An internal clustering validation index for boolean data. Cybernetics Inf Technol, 2016, 16(6): 232 doi: 10.1515/cait-2016-0091 [17] Gluck M. Information, uncertainty, and the utility of categories // Proceedings of the Seventh Annual Conference of the Cognitive Science Society. Irvine, 1985: 283 [18] Bai L, Liang J Y. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowledge Discovery, 2015, 29(6): 1560 doi: 10.1007/s10618-014-0387-5 [19] Chang C H, Ding Z K. Categorical data visualization and clustering using subjective factors. Data Knowledge Eng, 2005, 53(3): 243 doi: 10.1016/j.datak.2004.09.001 [20] Barbará D, Li Y, Couto J. COOLCAT: an entropy-based algorithm for categorical clustering // Proceedings of the 11th International Conference of Information Knowledge Management. McLean, 2002: 582 [21] Xiong H, Wu J J, Chen J. K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559 [22] Sangam R S, Om H. The k-modes algorithm with entropy based similarity coefficient. Procedia Comput Sci, 2015, 50: 93 doi: 10.1016/j.procs.2015.04.066 -