TY - JOUR
T1 - Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification
AU - Vairetti, Carla
AU - Assadi, José Luis
AU - Maldonado, Sebastián
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/7/15
Y1 - 2024/7/15
N2 - Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.
AB - Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.
KW - Big data
KW - Imbalanced classification
KW - Intelligent undersampling
KW - MapReduce
KW - SMOTE
UR - http://www.scopus.com/inward/record.url?scp=85182916241&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2024.123149
DO - 10.1016/j.eswa.2024.123149
M3 - Article
AN - SCOPUS:85182916241
SN - 0957-4174
VL - 246
SP - 1
EP - 11
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 123149
ER -