基于空間近鄰關系的非平衡數據重采樣算法

李睿峰; 李文海; 孫艷麗; 吳陽勇

doi:10.13374/j.issn2095-9389.2020.04.05.002

基于空間近鄰關系的非平衡數據重采樣算法

Resampling algorithm for imbalanced data based on their neighbor relationship

摘要

摘要: 為了提高非平衡數據集的分類精度，提出了一種基于樣本空間近鄰關系的重采樣算法。該方法首先根據數據集中少數類樣本的空間近鄰關系進行安全級別評估，根據安全級別有指導的采用合成少數類過采樣技術（Synthetic minority oversampling technique，SMOTE）進行升采樣；然后對多數類樣本依據其空間近鄰關系計算局部密度，從而對多數類樣本密集區域進行降采樣處理。通過以上兩種手段可以均衡測試數據集，并控制數據規模防止過擬合，實現對兩類樣本分類的均衡化。采用十折交叉驗證的方式產生訓練集和測試集，在對訓練集重采樣之后，以核超限學習機作為分類器進行訓練，并在測試集上進行驗證。在UCI非平衡數據集和電路故障診斷實測數據上的實驗結果表明，所提方法在整體上優于其他重采樣算法。

Abstract: The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications. The minority samples in such applications usually contain important information. This information plays an important role in data analysis. At present, two methods (improved algorithm and data set reconstruction) are used in machine learning and data mining to address the data set imbalance. Data set reconstruction is also known as the resampling method, which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used. As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information, thus reducing the classification accuracy. A reasonable oversampling and undersampling algorithm are the core of the resampling method. To improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbor relationship of sample space was proposed. This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level. Then, the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area. By the above two means, the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories. The training set and test set were generated via the method of 5 × 10 fold cross validation. After resampling the training set, the kernel extreme learning machine (KELM) was used as the classifier for training, and the test set was used for verification. The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms.

HTML全文

參考文獻(29)

施引文獻

資源附件(0)