Benchmark Of Data Preprocessing Methods For Imbalanced Classification
Abstract
Severe class imbalance poses significant challenges
for machine learning in cybersecurity. Various
preprocessing methods, including oversampling,
undersampling, and hybrid approaches, have been
developed to enhance the predictive performance of
classifiers. However, a comprehensive and unbiased
benchmark comparing these methods across diverse
cybersecurity problems is lacking. This paper
presents a benchmark of 16 preprocessing techniques
evaluated on six cybersecurity datasets, alongside 17
public imbalanced datasets from other domains. We
test these methods under multiple hyperparameter
configurations and utilize an Auto ML system to
reduce biases from specific hyperparameters or
classifiers. Our evaluation focuses on performance
measures that effectively reflect real-world
applicability in cybersecurity Effective data
preprocessing methods often improve classification
performance. A baseline approach of no
preprocessing outperformed many methods. 3)
Oversampling techniques generally yield better
results than under sampling and The standard
SMOTE algorithm delivered the most significant
performance gains, while more complex methods
often provided only incremental improvements with
reduced computational efficiency