<th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
<progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
<th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
<progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
<th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
<progress id="5nh9l"><noframes id="5nh9l">
  • 《工程索引》(EI)刊源期刊
  • 中文核心期刊
  • 中國科技論文統計源期刊
  • 中國科學引文數據庫來源期刊

留言板

尊敬的讀者、作者、審稿人, 關于本刊的投稿、審稿、編輯和出版的任何問題, 您可以本頁添加留言。我們將盡快給您答復。謝謝您的支持!

姓名
郵箱
手機號碼
標題
留言內容
驗證碼

基于屬性值集中度的分類數據聚類有效性內部評價指標

傅立偉 武森

傅立偉, 武森. 基于屬性值集中度的分類數據聚類有效性內部評價指標[J]. 工程科學學報, 2019, 41(5): 682-693. doi: 10.13374/j.issn2095-9389.2019.05.015
引用本文: 傅立偉, 武森. 基于屬性值集中度的分類數據聚類有效性內部評價指標[J]. 工程科學學報, 2019, 41(5): 682-693. doi: 10.13374/j.issn2095-9389.2019.05.015
FU Li-wei, WU Sen. A new internal clustering validation index for categorical data based on concentration of attribute values[J]. Chinese Journal of Engineering, 2019, 41(5): 682-693. doi: 10.13374/j.issn2095-9389.2019.05.015
Citation: FU Li-wei, WU Sen. A new internal clustering validation index for categorical data based on concentration of attribute values[J]. Chinese Journal of Engineering, 2019, 41(5): 682-693. doi: 10.13374/j.issn2095-9389.2019.05.015

基于屬性值集中度的分類數據聚類有效性內部評價指標

doi: 10.13374/j.issn2095-9389.2019.05.015
基金項目: 

國家自然科學基金資助項目 71271027

詳細信息
    通訊作者:

    武森, E-mail: wusen@manage.ustb.edu.cn

  • 中圖分類號: TP301

A new internal clustering validation index for categorical data based on concentration of attribute values

More Information
  • 摘要: 針對分類數據, 通過數據對象在屬性值上的集中程度定義了新的基于屬性值集中度的類內相似度(similarity based on concentration of attribute values, CONC), 用于衡量聚類結果中類內各數據對象之間的相似度; 通過不同類的特征屬性值的差異程度定義了基于強度向量差異的類間差異度(dissimilarity based on discrepancy of SVs, DCRP), 用于衡量兩個類之間的差異度.基于CONC和DCRP提出了新的分類數據聚類有效性內部評價指標(clustering validation based on concentration of attribute values, CVC), 它具有以下3個特點: (1)在評價每個類內相似度時, 不僅依靠類內各數據對象的特征, 還考慮了整個數據集的信息; (2)采用幾個特征屬性值的差異評價兩個類的差異度, 確保評價過程不丟失有效的聚類信息, 同時可以消除噪音的影響; (3)在評價類內相似度及類間差異度時, 消除了數據對象個數對評價過程的影響.采用加州大學歐文分校提出的用于機器學習的數據庫(UCI)進行實驗, 將CVC與類別效用(category utility, CU)指標、基于主觀因素的分類數據指標(categorical data clustering with subjective factors, CDCS)指標和基于信息熵的內部評價指標(information entropy, IE)等內部評價指標進行對比, 通過外部評價指標標準交互信息(normalized mutual information, NMI)驗證內部評價效果.實驗表明相對其他內部評價指標, CVC指標可以更有效地評價聚類結果.此外, CVC指標相對于NMI指標, 不需要數據集以外的信息, 更具實用性.

     

  • 圖  1  特征屬性值搜索示意圖

    Figure  1.  Schematic diagram for selecting the characteristic attribute values

    圖  2  子集的屬性值分布圖. (a) 子集C1的屬性值分布圖; (b) 子集C2的屬性值分布圖; (c) 子集C3的屬性值分布圖; (d) 子集C4的屬性值分布圖; (e) 子集C5的屬性值分布圖; (f) 子集C6的屬性值分布圖

    Figure  2.  Distribution of attribute values for six subsets: (a) distribution of subset C1; (b) distribution of subset C2; (c) distribution of subset C3; (d) distribution of subset C4; (e) distribution of subset C5; (f) distribution of subset C6

    圖  3  CVC隨參數變化實驗結果. (a) 數據集Small Soybean結果; (b) 數據集Chess結果; (c) 數據集Mushroom結果

    Figure  3.  Results of CVC with parameters: (a) results of dataset Small Soybean; (b) results of dataset Chess; (c) results of dataset Mushroom

    圖  4  各指標實驗結果. (a) 數據集Small Soybean結果; (b) 數據集Chess結果; (c) 數據集Mushroom結果

    Figure  4.  Results of indices: (a) results of dataset Small Soybean; (b) results of dataset Chess; (c) results of dataset Mushroom

    表  1  6個子集的DV向量表

    Table  1.   Distribution vector for six subsets

    屬性取值 子集
    C1 C2 C3 C4 C5 C6
    J 5 2 6 6 14 140
    K 8 8 5 5 1 10
    L 1 3 5 5 1 10
    M 1 2 5 5 1 10
    N 0 1 10
    O 1 10
    P 1 10
    域值個數 4 4 5 4 7 7
    數據對象個數 15 15 21 21 20 200
    ACONC 23.17 16.5 11.4 0.5 48.29 4828.57
    ACONCmax 112.5 112.5 176.4 220.5 114.29 11428.57
    屬性的CONC 0.2059 0.1467 0.0646 0.0023 0.4225 0.4225
    下載: 導出CSV

    表  2  聚類內部有效性評價指標特征

    Table  2.   Characteristics of internal clustering validation indices

    指標 指標特征
    考慮整體分布 受類中心影響 受類個數影響 數據規模影響 其他特性
    CVC指標 可消除噪音影響
    CU指標 未考慮類間分離度
    CDCS指標 不能消除噪音影響
    IE指標 無法應對均勻效應;未考慮類間分離度
    F指標 不能消除噪音影響;未考慮類間分離度
    下載: 導出CSV

    表  3  數據集X中的2個類的DV向量表

    Table  3.   Distribution vector for two clusters of X

    屬性A1 屬性A2
    取值個數 取值個數
    C1 K1 3 K2 10
    L1 5 L2 1
    M1 12 M2 6
    N2 3
    C2 K1 30 K2 3
    L1 7 L2 20
    M1 3 M2 2
    N2 15
    下載: 導出CSV

    表  4  實驗數據集及實驗參數設置

    Table  4.   Datasets and parameters for experiments

    數據集 對象個數 維度數 原始類個數 缺失值 聚類算法參數范圍 CVC參數范圍
    Small Soybean 47 35 4 No 2~10 1~8
    Chess 3196 36 2 No 2~20 1~8
    Mushroom 8124 21 2 Yes 2~20 1~8
    下載: 導出CSV

    表  5  Small Soybean數據集指標參數實驗結果平均值

    Table  5.   Average of clustering results for dataset Small Soybean with various parameters

    類個數 CVC/10-1 CU CDCS/104 IE/10-2 NMI/10-1
    ω=1 ω=2 ω=3 ω=4 ω=5 ω=6 ω=7 ω=8
    2 1.4496 2.0501 2.3012 2.4380 2.5240 2.5830 2.6260 2.6587 1.3853 1.6332 24.6496 2.8077
    3 1.2041 2.0856 2.5047 2.7448 2.8998 3.0079 3.0876 3.1488 1.5081 1.3996 11.2830 4.7695
    4 1.2060 2.4120 3.0389 3.4110 3.6559 3.8288 3.9572 4.0564 1.4370 1.1956 6.2084 7.7629
    5 1.1923 2.6662 3.4864 3.9869 4.3209 4.5591 4.7372 4.8753 1.4368 0.9450 2.8487 7.5003
    6 0.9293 2.2764 3.0686 3.5628 3.8967 4.1365 4.3168 4.4572 1.2951 0.8182 2.2805 6.3634
    7 0.8872 2.3472 3.2464 3.8179 4.2080 4.4900 4.7030 4.8693 1.2176 0.7458 1.6786 4.8953
    8 0.7921 2.2404 3.1684 3.7678 4.1807 4.4807 4.7082 4.8863 1.1380 0.6320 1.1927 4.0136
    9 0.7221 2.1663 3.1243 3.7521 4.1878 4.5060 4.7480 4.9380 1.0724 0.6412 0.9919 4.6183
    10 0.6912 2.1857 3.2081 3.8867 4.3610 4.7089 4.9742 5.1830 1.0300 0.5666 0.8601 4.3090
    下載: 導出CSV

    表  6  Chess數據集指標參數實驗結果平均值

    Table  6.   Average of clustering results for dataset Chess with various parameters

    類個數 CVC/10-1 CU/10-1 CDCS/108 IE/10-2 NMI/10-2
    ω=1 ω=2 ω=3 ω=4 ω=5 ω=6 ω=7 ω=8
    2 0.5404 0.7642 0.8578 0.9088 0.9409 0.9629 0.9789 0.9911 3.0616 9.1075 2.5055 0.2903
    3 0.5418 0.9385 1.1271 1.2351 1.3049 1.3536 1.3894 1.4170 3.6258 8.3664 1.6218 1.3391
    4 0.4359 0.8718 1.0984 1.2330 1.3215 1.3840 1.4304 1.4663 4.2637 3.9811 1.0895 1.6117
    5 0.3779 0.8451 1.1050 1.2636 1.3695 1.4450 1.5015 1.5452 3.4602 3.8622 0.8798 1.2910
    6 0.3511 0.8599 1.1592 1.3459 1.4720 1.5626 1.6307 1.6837 3.3915 3.1607 0.6528 1.8689
    7 0.3173 0.8395 1.1611 1.3656 1.5051 1.6060 1.6821 1.7416 3.2804 2.6272 0.5062 1.6133
    8 0.2812 0.7953 1.1247 1.3375 1.4840 1.5905 1.6713 1.7345 3.0207 2.3623 0.4474 2.2785
    9 0.2663 0.7990 1.1523 1.3838 1.5445 1.6619 1.7512 1.8212 2.9194 1.9676 0.3970 2.3652
    10 0.2348 0.7425 1.0898 1.3204 1.4815 1.5996 1.6898 1.7607 2.7256 1.7890 0.3700 2.0684
    11 0.2214 0.7344 1.0952 1.3374 1.5077 1.6332 1.7292 1.8048 2.5746 1.8519 0.3187 1.6655
    12 0.2100 0.7276 1.1009 1.3542 1.5333 1.6658 1.7673 1.8475 2.5005 1.5328 0.2896 2.3751
    13 0.2022 0.7290 1.1179 1.3843 1.5737 1.7142 1.8221 1.9075 2.3970 1.3409 0.2637 3.4893
    14 0.1871 0.7000 1.0866 1.3539 1.5449 1.6870 1.7964 1.8831 2.2667 1.2827 0.2394 3.0853
    15 0.1737 0.6726 1.0563 1.3238 1.5157 1.6589 1.7694 1.8570 2.1884 1.1587 0.2150 2.4495
    16 0.1634 0.6537 1.0377 1.3074 1.5018 1.6472 1.7596 1.8489 2.0798 1.2610 0.2086 1.5850
    17 0.1592 0.6563 1.0523 1.3326 1.5354 1.6874 1.8052 1.8989 2.0524 0.9038 0.1913 2.1058
    18 0.1520 0.6448 1.0438 1.3281 1.5346 1.6898 1.8102 1.9061 1.9830 0.8960 0.1743 3.1087
    19 0.1457 0.6349 1.0371 1.3255 1.5358 1.6942 1.8172 1.9153 1.8953 0.7658 0.1625 2.2609
    20 0.1438 0.6431 1.0596 1.3600 1.5798 1.7457 1.8747 1.9777 1.8364 0.7778 0.1660 3.8407
    下載: 導出CSV

    表  7  Mushroom數據集指標參數實驗結果平均值

    Table  7.   Average of clustering results for dataset Mushroom with various parameters

    類個數 CVC/10-1 CU/10-1 CDCS/109 IE/10-1 NMI/10-1
    ω=1 ω=2 ω=3 ω=4 ω=5 ω=6 ω=7 ω=8
    2 1.130 1.598 1.793 1.8999 1.967 2.013 2.046 2.072 5.5295 5.4395 7.2900 1.513
    3 1.066 1.847 2.218 2.4306 2.568 2.664 2.734 2.788 8.3726 2.9593 4.3948 1.568
    4 0.852 1.704 2.147 2.4094 2.582 2.704 2.795 2.865 7.5823 3.3869 3.1450 0.462
    5 0.847 1.895 2.477 2.8330 3.070 3.240 3.366 3.464 8.1918 1.8485 2.2165 2.307
    6 0.726 1.779 2.397 2.7836 3.044 3.232 3.373 3.482 7.5752 1.7727 1.8483 2.100
    7 0.640 1.692 2.341 2.7526 3.034 3.237 3.391 3.511 7.2510 1.0476 1.4268 2.145
    8 0.596 1.685 2.383 2.7980 3.144 3.370 3.541 3.675 6.6447 1.1869 1.2363 1.374
    9 0.519 1.556 2.244 2.6950 3.008 3.237 3.410 3.547 6.1161 0.9955 1.0594 0.964
    10 0.480 1.517 2.227 2.6975 3.027 3.268 3.452 3.597 5.6701 0.8328 0.9542 1.708
    11 0.437 1.450 2.162 2.6405 2.977 3.225 3.414 3.563 5.3697 0.5910 0.8196 1.634
    12 0.414 1.436 2.172 2.6723 3.026 3.287 3.487 3.646 5.0980 0.4543 0.7651 1.190
    13 0.406 1.465 2.247 2.7823 3.163 3.445 3.662 3.834 5.0402 0.4729 0.6505 1.547
    14 0.381 1.424 2.211 2.7545 3.143 3.432 3.655 3.831 4.7980 0.3362 0.5851 1.135
    15 0.354 1.369 2.150 2.6946 3.085 3.377 3.602 3.780 4.4620 0.3761 0.5321 1.276
    16 0.343 1.373 2.179 2.7450 3.153 3.458 3.695 3.882 4.2832 0.2973 0.4927 1.077
    17 0.335 1.381 2.215 2.8051 3.232 3.552 3.800 3.997 4.3443 0.1894 0.4490 1.681
    18 0.311 1.320 2.137 2.7188 3.142 3.459 3.706 3.902 3.9230 0.3621 0.4277 1.584
    19 0.309 1.347 2.200 2.8112 3.257 3.593 3.854 4.062 4.0827 0.1911 0.3681 1.429
    20 0.287 1.285 2.118 2.7185 3.158 3.489 3.747 3.953 3.7305 0.1268 0.3684 1.715
    下載: 導出CSV
    <th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
    <progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
    <th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
    <progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
    <span id="5nh9l"><noframes id="5nh9l">
    <span id="5nh9l"><noframes id="5nh9l">
    <span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
    <th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
    <progress id="5nh9l"><noframes id="5nh9l">
    259luxu-164
  • [1] Cornuéjols A, Wemmert C, Gan?arski P, et al. Collaborative clustering: why, when, what and how. Inf Fusion, 2017, 39: 81
    [2] Yang H, Fu Y, Fan D. Influence of noisy features on internal validation of clustering. Comput Sci, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm

    楊虎, 付宇, 范丹. 噪音特征對聚類內部有效性的影響. 計算機科學, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm
    [3] Cheung Y M, Jia H. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognit, 2013, 46(8): 2228 doi: 10.1016/j.patcog.2013.01.027
    [4] dos Santos T R L, Zárate L E. Categorical data clustering: what similarity measure to recommend?. Expert Syst Appl, 2015, 42(3): 1247 doi: 10.1016/j.eswa.2014.09.012
    [5] Wu S, Jiang D D, Wang Q. HABOS clustering algorithm for categorical data. Chin J Eng, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm

    武森, 姜丹丹, 王薔. 分類屬性數據聚類算法HABOS. 工程科學學報, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm
    [6] Ilango V, Subramanian R, Vasudevan V. Cluster analysis research design model, problems, issues, challenges, trends and tools. Int J Comput Sci Eng, 2011, 3(8): 3064
    [7] Huang D, Lai J H, Wang C D. Ensemble clustering using factor graph. Pattern Recognit, 2016, 50: 131 doi: 10.1016/j.patcog.2015.08.015
    [8] Huang D, Wang C D, Lai J H, et al. Clustering ensemble by decision weighting. CAAI Trans Intell Syst, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm

    黃棟, 王昌棟, 賴劍煌, 等. 基于決策加權的聚類集成算法. 智能系統學報, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm
    [9] Zhao X W, Liang J Y, Dang C Y. Clustering ensemble selection for categorical data based on internal validity indices. Pattern Recognit, 2017, 69: 150 doi: 10.1016/j.patcog.2017.04.019
    [10] Jaskowiak P A, Moulavi D, Furtado A C S, et al. On strategies for building effective ensembles of relative clustering validity criteria. Knowledge Inf Syst, 2016, 47(2): 329 doi: 10.1007/s10115-015-0851-6
    [11] Yu Z W, Li L, Gao Y J, et al. Hybrid clustering solution selection strategy. Pattern Recognit, 2014, 47(10): 3362 doi: 10.1016/j.patcog.2014.04.005
    [12] Li F J, Qian Y H, Wang J T, et al. Cluster's quality evaluation and selective clustering ensemble. ACM Trans Knowledge Discovery Data, 2018, 12(5): 60
    [13] Larsen B, Aone C. Fast and effective text mining using linear-time document clustering // Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, 1999: 16
    [14] Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: Part I. ACM SIGMOD Record, 2002, 31(2): 40 doi: 10.1145/565117.565124
    [15] Halkidi M, Batistakis Y, Vazirgiannis M. Clustering validity checking methods: Part Ⅱ. SIGMOD Record, 2002, 31(3): 19 doi: 10.1145/601858.601862
    [16] Fu L W, Wu S. An internal clustering validation index for boolean data. Cybernetics Inf Technol, 2016, 16(6): 232 doi: 10.1515/cait-2016-0091
    [17] Gluck M. Information, uncertainty, and the utility of categories // Proceedings of the Seventh Annual Conference of the Cognitive Science Society. Irvine, 1985: 283
    [18] Bai L, Liang J Y. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowledge Discovery, 2015, 29(6): 1560 doi: 10.1007/s10618-014-0387-5
    [19] Chang C H, Ding Z K. Categorical data visualization and clustering using subjective factors. Data Knowledge Eng, 2005, 53(3): 243 doi: 10.1016/j.datak.2004.09.001
    [20] Barbará D, Li Y, Couto J. COOLCAT: an entropy-based algorithm for categorical clustering // Proceedings of the 11th International Conference of Information Knowledge Management. McLean, 2002: 582
    [21] Xiong H, Wu J J, Chen J. K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559
    [22] Sangam R S, Om H. The k-modes algorithm with entropy based similarity coefficient. Procedia Comput Sci, 2015, 50: 93 doi: 10.1016/j.procs.2015.04.066
  • 加載中
圖(4) / 表(7)
計量
  • 文章訪問數:  968
  • HTML全文瀏覽量:  442
  • PDF下載量:  25
  • 被引次數: 0
出版歷程
  • 收稿日期:  2018-04-18
  • 刊出日期:  2019-05-01

目錄

    /

    返回文章
    返回