HABOS clustering algorithm for categorical data
-
摘要: CABOSFV_C是一種針對分類屬性高維數據的高效聚類算法,該算法采用集合稀疏差異度進行距離計算,并采用稀疏特征向量實現數據壓縮.該算法的聚類效果受集合稀疏差異度上限參數的影響,而該參數的選取沒有明確的指導.針對該問題提出基于集合稀疏差異度的啟發式分類屬性數據層次聚類算法(heuristic hierarchical clustering algorithm of categorical data based on sparse feature dissimilarity,HABOS),該方法從聚結型層次聚類思想的角度出發,在聚類數上限參數的約束下,應用新的內部聚類有效性評價指標(clustering validation index based on sparse feature dissimilarity,CVISFD)進行啟發式度量,從而實現對聚類層次的自動選取.UCI基準數據集的實驗結果表明,HABOS有效地提高了聚類準確性和穩定性.Abstract: The clustering algorithm based on sparse feature vector for categorical attributes(CABOSFVC) is an efficient high-dimensional clustering method for categorical data. Sparse feature dissimilarity(SFD) is used to calculate the distance and sparse feature vector is used to achieve data compression. However,CABOSFVC algorithm is dependent upon SFD upper limit parameter for which there is no guidance for configuration. Aimed at solving the problem that CABOSFVC algorithm is sensitive to this parameter,a new heuristic hierarchical clustering algorithm of categorical data based on SFD(HABOS) was proposed in this paper. With the constraint of the upper limit number of clusters,this algorithm applied agglomerative hierarchical clustering and the new internal clustering validation index based on SFD(CVISFD) which was used to measure the results heuristically to achieve the best choice of the clustering level. Three UCI benchmark data sets were used to compare the improved algorithm with the traditional ones. The empirical tests show that HABOS increases the clustering accuracy and stability effectively.
-
Key words:
- data mining /
- clustering algorithms /
- categorical data /
- attributes
-

計量
- 文章訪問數: 227
- HTML全文瀏覽量: 50
- PDF下載量: 11
- 被引次數: 0