Aarhus University Seal

High dimensional classifiers in the imbalanced case

by Britta Anker Bak and Jens Ledet Jensen
Thiele Research Reports Number 4 (May 2015)

We consider the binary classification problem in the imbalanced case where the number of samples from the two groups differ. The classification problem is considered in the high dimensional case where the number of variables is much larger than the number of samples, and where the imbalance leads to a bias in the classification. A theoretical analysis of the independence classifier reveals the origin of the bias and based on this we suggest two new classifiers that can handle any imbalance ratio. The analytical results are supplemented by a simulation study, where the suggested classifiers in some aspects outperform multiple undersampling. For correlated data we consider the ROAD classifier and suggest a modification of this to handle the bias from imbalanced group sizes.

Format available: PDF (548 KB)