基于聚類欠采樣的集成不均衡數據分類算法

武森; 劉露; 盧丹

doi:10.13374/j.issn2095-9389.2017.08.015

基于聚類欠采樣的集成不均衡數據分類算法

Imbalanced data ensemble classification based on cluster-based under-sampling algorithm

摘要

摘要: 傳統的分類算法大多假設數據集是均衡的,追求整體的分類精度.而實際數據集經常是不均衡的,因此傳統的分類算法在處理實際數據集時容易導致少數類樣本有較高的分類錯誤率.現有針對不均衡數據集改進的分類方法主要有兩類:一類是進行數據層面的改進,用過采樣或欠采樣的方法增加少數類數據或減少多數類數據;另一個是進行算法層面的改進.本文在原有的基于聚類的欠采樣方法和集成學習方法的基礎上,采用兩種方法相結合的思想,對不均衡數據進行分類.即先在數據處理階段采用基于聚類的欠采樣方法形成均衡數據集,然后用AdaBoost集成算法對新的數據集進行分類訓練,并在算法集成過程中引用權重來區分少數類數據和多數類數據對計算集成學習錯誤率的貢獻,進而使算法更關注少數數據類,提高少數類數據的分類精度.

Abstract: Most traditional classification algorithms assume the data set to be well-balanced and focus on achieving overall classification accuracy. However, actual data sets are usually imbalanced, so traditional classification approaches may lead to classification errors in minority class samples. With respect to imbalanced data, there are two main methods for improving classification performance. The first is to improve the data set by increasing the number of minority class samples by over-sampling and decreasing the number of majority class samples by under-sampling. The other method is to improve the algorithm itself. By combining the cluster-based under-sampling method with ensemble classification, in this paper, an approach was proposed for classifying imbalanced data. First, the cluster-based under-sampling method is used to establish a balanced data set in the data processing stage, and then the new data set is trained by the AdaBoost ensemble algorithm. In the integration process, when calculating the error rate of integrated learning, this algorithm uses weights to distinguish minority class data from majority class data. This makes the algorithm focus more on small data classes, thereby improving the classification accuracy of minority class data.

HTML全文

參考文獻(9)

施引文獻

資源附件(0)