圓形網格抽樣和逆近鄰優化的密度峰值聚類算法

趙嘉; 何超凡; 肖人彬; 曹浩; 樊棠懷

doi:10.13374/j.issn2095-9389.2025.03.30.001

圓形網格抽樣和逆近鄰優化的密度峰值聚類算法

Density peaks clustering algorithm with circle-division sampling and reverse nearest neighbor optimization

摘要

摘要: 密度峰值聚類(DPC)算法是一種簡單高效的聚類算法,因其可直觀和快速發現數據集中的類簇而得到廣泛關注。但DPC算法局部密度定義未考慮類簇間密度差異影響,易誤選類簇中心;使用鏈式分配策略,易產生錯誤連帶效應;需計算所有樣本間的歐氏距離,算法的時間復雜度較高。因此,本文提出一種圓形網格抽樣和逆近鄰優化的密度峰值聚類算法。該算法采用圓形網格抽樣得到代表以減少需要計算的樣本數,降低算法計算的時間開銷;引入近似近鄰策略加強代表和初始樣本的聯系,減少抽樣導致的聚類精度丟失;利用逆近鄰優化局部密度定義策略,調節樣本局部密度的大小,準確找到密度峰值;通過共享逆近鄰計算相似性,由相似性矩陣分配代表,避免樣本分配策略產生的錯誤連帶效應。設置了復雜形態合成數據集、真實數據集和較大規模數據集進行分組實驗。實驗結果表明,本文算法對復雜形態、真實和較大規模數據集的聚類優勢明顯,精度及效率相對DPC算法及其改進算法均有較大提升。

Abstract: The Density Peaks Clustering (DPC) algorithm is a simple and efficient clustering algorithm that has garnered wide attention for its ability to intuitively and quickly identify clusters within datasets. However, the DPC algorithm's definition of local density does not account for the impact of density differences between clusters, which can easily lead to the incorrect selection of cluster centers. Additionally, the use of a chain assignment strategy can result in erroneous cascading effects, and the algorithm requires the calculation of Euclidean distances between all samples, leading to high time complexity. To address these issues, this paper proposes a Density Peaks Clustering algorithm with circle-division sampling and reverse nearest neighbors. The proposed algorithm employs circular grid sampling to obtain representative samples, thereby reducing the number of samples that need to be computed and lowering the algorithm's computational time overhead. It introduces an approximate K-nearest neighbor strategy to strengthen the connection between representatives and initial samples, minimizing the loss of clustering accuracy caused by sampling. The algorithm uses reverse nearest neighbors to optimize the definition of local density, adjusting the local density of samples to accurately identify density peaks. By calculating similarity through shared reverse nearest neighbors and assigning representatives based on the similarity matrix, it avoids the erroneous cascading effects caused by sample assignment strategies. The algorithm was tested through a series of experiments on synthetic datasets with complex shapes, real datasets, and larger-scale datasets. Experimental results indicate that the proposed algorithm exhibits significant clustering advantages for complex shapes, real, and large-scale datasets, with notable improvements in accuracy and efficiency compared to the DPC algorithm and its improved versions.

HTML全文

參考文獻(0)

施引文獻

資源附件(0)