基于屬性值集中度的分類數據聚類有效性內部評價指標

摘要: 針對分類數據, 通過數據對象在屬性值上的集中程度定義了新的基于屬性值集中度的類內相似度(similarity based on concentration of attribute values, CONC), 用于衡量聚類結果中類內各數據對象之間的相似度; 通過不同類的特征屬性值的差異程度定義了基于強度向量差異的類間差異度(dissimilarity based on discrepancy of SVs, DCRP), 用于衡量兩個類之間的差異度.基于CONC和DCRP提出了新的分類數據聚類有效性內部評價指標(clustering validation based on concentration of attribute values, CVC), 它具有以下3個特點: (1)在評價每個類內相似度時, 不僅依靠類內各數據對象的特征, 還考慮了整個數據集的信息; (2)采用幾個特征屬性值的差異評價兩個類的差異度, 確保評價過程不丟失有效的聚類信息, 同時可以消除噪音的影響; (3)在評價類內相似度及類間差異度時, 消除了數據對象個數對評價過程的影響.采用加州大學歐文分校提出的用于機器學習的數據庫(UCI)進行實驗, 將CVC與類別效用(category utility, CU)指標、基于主觀因素的分類數據指標(categorical data clustering with subjective factors, CDCS)指標和基于信息熵的內部評價指標(information entropy, IE)等內部評價指標進行對比, 通過外部評價指標標準交互信息(normalized mutual information, NMI)驗證內部評價效果.實驗表明相對其他內部評價指標, CVC指標可以更有效地評價聚類結果.此外, CVC指標相對于NMI指標, 不需要數據集以外的信息, 更具實用性.
- 聚類分析 /
- 聚類內部有效性評價指標 /
- 分類數據 /
- 高維數據 /
- 相似度 /
- 差異度
Abstract: Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, similarity/dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering validation indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity of objects in a cluster, was defined. Similarly, a new dissimilarity based on the discrepancy of characteristic attribute values, called DCRP, which can evaluate the dissimilarity between two clusters, was defined. A new internal clustering validation index, called CVC, which is based on CONC and DCRP, was proposed. Compared to other indices, CVC has three characteristics: (1) it evaluates the compactness of a cluster based on the information of the whole dataset and not only that of a cluster; (2) it evaluates the separation between two clusters by several characteristic attributes values so that the clustering information is not lost and the negative effects caused by noise are eliminated; (3) it evaluates the compactness and separation without influence from the number of objects. Furthermore, UCI benchmark datasets were used to compare the proposed index with other internal clustering validation indices (CU, CDCS, and IE). An external index (NMI) was used to evaluate the effect of these internal indices. According to the experiment results, CVC is more effective than the other internal clustering validation indices. In addition, CVC, as an internal index, is more applicable than the NMI external index, because it can evaluate the clustering results without external information.
- cluster analysis /
- internal clustering validation index /
- categorical data /
- high dimensional data /
- similarity /
- dissimi-larity

HTML全文

圖 1 特征屬性值搜索示意圖

Figure 1. Schematic diagram for selecting the characteristic attribute values

下載: 全尺寸圖片幻燈片

圖 2 子集的屬性值分布圖. (a) 子集C₁的屬性值分布圖; (b) 子集C₂的屬性值分布圖; (c) 子集C₃的屬性值分布圖; (d) 子集C₄的屬性值分布圖; (e) 子集C₅的屬性值分布圖; (f) 子集C₆的屬性值分布圖

Figure 2. Distribution of attribute values for six subsets: (a) distribution of subset C₁; (b) distribution of subset C₂; (c) distribution of subset C₃; (d) distribution of subset C₄; (e) distribution of subset C₅; (f) distribution of subset C₆

下載: 全尺寸圖片幻燈片

圖 3 CVC隨參數變化實驗結果. (a) 數據集Small Soybean結果; (b) 數據集Chess結果; (c) 數據集Mushroom結果

Figure 3. Results of CVC with parameters: (a) results of dataset Small Soybean; (b) results of dataset Chess; (c) results of dataset Mushroom

下載: 全尺寸圖片幻燈片

圖 4 各指標實驗結果. (a) 數據集Small Soybean結果; (b) 數據集Chess結果; (c) 數據集Mushroom結果

Figure 4. Results of indices: (a) results of dataset Small Soybean; (b) results of dataset Chess; (c) results of dataset Mushroom

下載: 全尺寸圖片幻燈片

表 1 6個子集的DV向量表

Table 1. Distribution vector for six subsets

屬性取值	子集
屬性取值	C₁	C₂	C₃	C₄	C₅	C₆
J	5	2	6	6	14	140
K	8	8	5	5	1	10
L	1	3	5	5	1	10
M	1	2	5	5	1	10
N			0		1	10
O					1	10
P					1	10
域值個數	4	4	5	4	7	7
數據對象個數	15	15	21	21	20	200
ACONC	23.17	16.5	11.4	0.5	48.29	4828.57
ACONC_max	112.5	112.5	176.4	220.5	114.29	11428.57
屬性的CONC	0.2059	0.1467	0.0646	0.0023	0.4225	0.4225

下載: 導出CSV

表 2 聚類內部有效性評價指標特征

Table 2. Characteristics of internal clustering validation indices

指標	指標特征
指標	考慮整體分布	受類中心影響	受類個數影響	數據規模影響	其他特性
CVC指標	是	否	否	否	可消除噪音影響
CU指標	否	否	否	否	未考慮類間分離度
CDCS指標	否	是	是	是	不能消除噪音影響
IE指標	否	否	是	否	無法應對均勻效應；未考慮類間分離度
F指標	否	是		是	不能消除噪音影響；未考慮類間分離度

下載: 導出CSV

表 3 數據集X中的2個類的DV向量表

Table 3. Distribution vector for two clusters of X

類	屬性A₁		屬性A₂
類	域	取值個數	域	取值個數
C₁	K₁	3	K₂	10
	L₁	5	L₂	1
	M₁	12	M₂	6
			N₂	3
C₂	K₁	30	K₂	3
	L₁	7	L₂	20
	M₁	3	M₂	2
			N₂	15

下載: 導出CSV

表 4 實驗數據集及實驗參數設置

Table 4. Datasets and parameters for experiments

數據集	對象個數	維度數	原始類個數	缺失值	聚類算法參數范圍	CVC參數范圍
Small Soybean	47	35	4	No	2~10	1~8
Chess	3196	36	2	No	2~20	1~8
Mushroom	8124	21	2	Yes	2~20	1~8

下載: 導出CSV

表 5 Small Soybean數據集指標參數實驗結果平均值

Table 5. Average of clustering results for dataset Small Soybean with various parameters

類個數	CVC/10^-1								CU	CDCS/10⁴	IE/10^-2	NMI/10^-1
類個數	ω=1	ω=2	ω=3	ω=4	ω=5	ω=6	ω=7	ω=8	CU	CDCS/10⁴	IE/10^-2	NMI/10^-1
2	1.4496	2.0501	2.3012	2.4380	2.5240	2.5830	2.6260	2.6587	1.3853	1.6332	24.6496	2.8077
3	1.2041	2.0856	2.5047	2.7448	2.8998	3.0079	3.0876	3.1488	1.5081	1.3996	11.2830	4.7695
4	1.2060	2.4120	3.0389	3.4110	3.6559	3.8288	3.9572	4.0564	1.4370	1.1956	6.2084	7.7629
5	1.1923	2.6662	3.4864	3.9869	4.3209	4.5591	4.7372	4.8753	1.4368	0.9450	2.8487	7.5003
6	0.9293	2.2764	3.0686	3.5628	3.8967	4.1365	4.3168	4.4572	1.2951	0.8182	2.2805	6.3634
7	0.8872	2.3472	3.2464	3.8179	4.2080	4.4900	4.7030	4.8693	1.2176	0.7458	1.6786	4.8953
8	0.7921	2.2404	3.1684	3.7678	4.1807	4.4807	4.7082	4.8863	1.1380	0.6320	1.1927	4.0136
9	0.7221	2.1663	3.1243	3.7521	4.1878	4.5060	4.7480	4.9380	1.0724	0.6412	0.9919	4.6183
10	0.6912	2.1857	3.2081	3.8867	4.3610	4.7089	4.9742	5.1830	1.0300	0.5666	0.8601	4.3090

下載: 導出CSV

表 6 Chess數據集指標參數實驗結果平均值

Table 6. Average of clustering results for dataset Chess with various parameters

類個數	CVC/10^-1								CU/10^-1	CDCS/10⁸	IE/10^-2	NMI/10^-2
類個數	ω=1	ω=2	ω=3	ω=4	ω=5	ω=6	ω=7	ω=8	CU/10^-1	CDCS/10⁸	IE/10^-2	NMI/10^-2
2	0.5404	0.7642	0.8578	0.9088	0.9409	0.9629	0.9789	0.9911	3.0616	9.1075	2.5055	0.2903
3	0.5418	0.9385	1.1271	1.2351	1.3049	1.3536	1.3894	1.4170	3.6258	8.3664	1.6218	1.3391
4	0.4359	0.8718	1.0984	1.2330	1.3215	1.3840	1.4304	1.4663	4.2637	3.9811	1.0895	1.6117
5	0.3779	0.8451	1.1050	1.2636	1.3695	1.4450	1.5015	1.5452	3.4602	3.8622	0.8798	1.2910
6	0.3511	0.8599	1.1592	1.3459	1.4720	1.5626	1.6307	1.6837	3.3915	3.1607	0.6528	1.8689
7	0.3173	0.8395	1.1611	1.3656	1.5051	1.6060	1.6821	1.7416	3.2804	2.6272	0.5062	1.6133
8	0.2812	0.7953	1.1247	1.3375	1.4840	1.5905	1.6713	1.7345	3.0207	2.3623	0.4474	2.2785
9	0.2663	0.7990	1.1523	1.3838	1.5445	1.6619	1.7512	1.8212	2.9194	1.9676	0.3970	2.3652
10	0.2348	0.7425	1.0898	1.3204	1.4815	1.5996	1.6898	1.7607	2.7256	1.7890	0.3700	2.0684
11	0.2214	0.7344	1.0952	1.3374	1.5077	1.6332	1.7292	1.8048	2.5746	1.8519	0.3187	1.6655
12	0.2100	0.7276	1.1009	1.3542	1.5333	1.6658	1.7673	1.8475	2.5005	1.5328	0.2896	2.3751
13	0.2022	0.7290	1.1179	1.3843	1.5737	1.7142	1.8221	1.9075	2.3970	1.3409	0.2637	3.4893
14	0.1871	0.7000	1.0866	1.3539	1.5449	1.6870	1.7964	1.8831	2.2667	1.2827	0.2394	3.0853
15	0.1737	0.6726	1.0563	1.3238	1.5157	1.6589	1.7694	1.8570	2.1884	1.1587	0.2150	2.4495
16	0.1634	0.6537	1.0377	1.3074	1.5018	1.6472	1.7596	1.8489	2.0798	1.2610	0.2086	1.5850
17	0.1592	0.6563	1.0523	1.3326	1.5354	1.6874	1.8052	1.8989	2.0524	0.9038	0.1913	2.1058
18	0.1520	0.6448	1.0438	1.3281	1.5346	1.6898	1.8102	1.9061	1.9830	0.8960	0.1743	3.1087
19	0.1457	0.6349	1.0371	1.3255	1.5358	1.6942	1.8172	1.9153	1.8953	0.7658	0.1625	2.2609
20	0.1438	0.6431	1.0596	1.3600	1.5798	1.7457	1.8747	1.9777	1.8364	0.7778	0.1660	3.8407

下載: 導出CSV

表 7 Mushroom數據集指標參數實驗結果平均值

Table 7. Average of clustering results for dataset Mushroom with various parameters

類個數	CVC/10^-1								CU/10^-1	CDCS/10⁹	IE/10^-1	NMI/10^-1
類個數	ω=1	ω=2	ω=3	ω=4	ω=5	ω=6	ω=7	ω=8	CU/10^-1	CDCS/10⁹	IE/10^-1	NMI/10^-1
2	1.130	1.598	1.793	1.8999	1.967	2.013	2.046	2.072	5.5295	5.4395	7.2900	1.513
3	1.066	1.847	2.218	2.4306	2.568	2.664	2.734	2.788	8.3726	2.9593	4.3948	1.568
4	0.852	1.704	2.147	2.4094	2.582	2.704	2.795	2.865	7.5823	3.3869	3.1450	0.462
5	0.847	1.895	2.477	2.8330	3.070	3.240	3.366	3.464	8.1918	1.8485	2.2165	2.307
6	0.726	1.779	2.397	2.7836	3.044	3.232	3.373	3.482	7.5752	1.7727	1.8483	2.100
7	0.640	1.692	2.341	2.7526	3.034	3.237	3.391	3.511	7.2510	1.0476	1.4268	2.145
8	0.596	1.685	2.383	2.7980	3.144	3.370	3.541	3.675	6.6447	1.1869	1.2363	1.374
9	0.519	1.556	2.244	2.6950	3.008	3.237	3.410	3.547	6.1161	0.9955	1.0594	0.964
10	0.480	1.517	2.227	2.6975	3.027	3.268	3.452	3.597	5.6701	0.8328	0.9542	1.708
11	0.437	1.450	2.162	2.6405	2.977	3.225	3.414	3.563	5.3697	0.5910	0.8196	1.634
12	0.414	1.436	2.172	2.6723	3.026	3.287	3.487	3.646	5.0980	0.4543	0.7651	1.190
13	0.406	1.465	2.247	2.7823	3.163	3.445	3.662	3.834	5.0402	0.4729	0.6505	1.547
14	0.381	1.424	2.211	2.7545	3.143	3.432	3.655	3.831	4.7980	0.3362	0.5851	1.135
15	0.354	1.369	2.150	2.6946	3.085	3.377	3.602	3.780	4.4620	0.3761	0.5321	1.276
16	0.343	1.373	2.179	2.7450	3.153	3.458	3.695	3.882	4.2832	0.2973	0.4927	1.077
17	0.335	1.381	2.215	2.8051	3.232	3.552	3.800	3.997	4.3443	0.1894	0.4490	1.681
18	0.311	1.320	2.137	2.7188	3.142	3.459	3.706	3.902	3.9230	0.3621	0.4277	1.584
19	0.309	1.347	2.200	2.8112	3.257	3.593	3.854	4.062	4.0827	0.1911	0.3681	1.429
20	0.287	1.285	2.118	2.7185	3.158	3.489	3.747	3.953	3.7305	0.1268	0.3684	1.715

下載: 導出CSV

259luxu-164

參考文獻(22)

[1]	Cornuéjols A, Wemmert C, Gan?arski P, et al. Collaborative clustering: why, when, what and how. Inf Fusion, 2017, 39: 81
[2]	Yang H, Fu Y, Fan D. Influence of noisy features on internal validation of clustering. Comput Sci, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm 楊虎, 付宇, 范丹. 噪音特征對聚類內部有效性的影響. 計算機科學, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm
[3]	Cheung Y M, Jia H. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognit, 2013, 46(8): 2228 doi: 10.1016/j.patcog.2013.01.027
[4]	dos Santos T R L, Zárate L E. Categorical data clustering: what similarity measure to recommend?. Expert Syst Appl, 2015, 42(3): 1247 doi: 10.1016/j.eswa.2014.09.012
[5]	Wu S, Jiang D D, Wang Q. HABOS clustering algorithm for categorical data. Chin J Eng, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm 武森, 姜丹丹, 王薔. 分類屬性數據聚類算法HABOS. 工程科學學報, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm
[6]	Ilango V, Subramanian R, Vasudevan V. Cluster analysis research design model, problems, issues, challenges, trends and tools. Int J Comput Sci Eng, 2011, 3(8): 3064
[7]	Huang D, Lai J H, Wang C D. Ensemble clustering using factor graph. Pattern Recognit, 2016, 50: 131 doi: 10.1016/j.patcog.2015.08.015
[8]	Huang D, Wang C D, Lai J H, et al. Clustering ensemble by decision weighting. CAAI Trans Intell Syst, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm 黃棟, 王昌棟, 賴劍煌, 等. 基于決策加權的聚類集成算法. 智能系統學報, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm
[9]	Zhao X W, Liang J Y, Dang C Y. Clustering ensemble selection for categorical data based on internal validity indices. Pattern Recognit, 2017, 69: 150 doi: 10.1016/j.patcog.2017.04.019
[10]	Jaskowiak P A, Moulavi D, Furtado A C S, et al. On strategies for building effective ensembles of relative clustering validity criteria. Knowledge Inf Syst, 2016, 47(2): 329 doi: 10.1007/s10115-015-0851-6
[11]	Yu Z W, Li L, Gao Y J, et al. Hybrid clustering solution selection strategy. Pattern Recognit, 2014, 47(10): 3362 doi: 10.1016/j.patcog.2014.04.005
[12]	Li F J, Qian Y H, Wang J T, et al. Cluster's quality evaluation and selective clustering ensemble. ACM Trans Knowledge Discovery Data, 2018, 12(5): 60
[13]	Larsen B, Aone C. Fast and effective text mining using linear-time document clustering // Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, 1999: 16
[14]	Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: Part I. ACM SIGMOD Record, 2002, 31(2): 40 doi: 10.1145/565117.565124
[15]	Halkidi M, Batistakis Y, Vazirgiannis M. Clustering validity checking methods: Part Ⅱ. SIGMOD Record, 2002, 31(3): 19 doi: 10.1145/601858.601862
[16]	Fu L W, Wu S. An internal clustering validation index for boolean data. Cybernetics Inf Technol, 2016, 16(6): 232 doi: 10.1515/cait-2016-0091
[17]	Gluck M. Information, uncertainty, and the utility of categories // Proceedings of the Seventh Annual Conference of the Cognitive Science Society. Irvine, 1985: 283
[18]	Bai L, Liang J Y. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowledge Discovery, 2015, 29(6): 1560 doi: 10.1007/s10618-014-0387-5
[19]	Chang C H, Ding Z K. Categorical data visualization and clustering using subjective factors. Data Knowledge Eng, 2005, 53(3): 243 doi: 10.1016/j.datak.2004.09.001
[20]	Barbará D, Li Y, Couto J. COOLCAT: an entropy-based algorithm for categorical clustering // Proceedings of the 11th International Conference of Information Knowledge Management. McLean, 2002: 582
[21]	Xiong H, Wu J J, Chen J. K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559
[22]	Sangam R S, Om H. The k-modes algorithm with entropy based similarity coefficient. Procedia Comput Sci, 2015, 50: 93 doi: 10.1016/j.procs.2015.04.066