The Synthetic Minority Over-sampling Technique (SMOTE) is a well-known resampling strategy that has been successfully used for dealing with the class-imbalance problem, one of the most challenging pattern recognition tasks in the last two decades. In this work, we claim that SMOTE has an important issue when defining the neighborhood in order to create new minority samples: the use of the Euclidean distance may not be suitable in high-dimensional settings. Our hypothesis is that the use of a weighted metric that does not assume that all features are equally important could improve performance in the presence of noisy/redundant variables. In this line, we present a novel SMOTE-like method that uses the weighted Minkowski distance for defining the neighborhood for each example of the minority class. This methodology leads to a better definition of the neighborhood since it prioritizes those features that are more relevant for the classification task. A complementary advantage of the proposal is performing feature selection since attributes can be discarded when their corresponding weights are below a given threshold. Our experiments on 42 class-imbalance datasets show the virtues of the proposed SMOTE variant, achieving the best predictive performance when compared with the traditional SMOTE approach and other recent variants on low- and high-dimensional settings, handling issues such as class overlap and hubness adequately without increasing the complexity of the method.
|State||Published - Apr 2022|
Bibliographical noteFunding Information:
This research was partially funded by ANID, FONDECYT project 1200221 and 12200007, and by PIA/BASAL AFB180003. It has been also partially supported by the Spanish Ministry of Science and Technology under project PID2020-119478GB-I00, including European Regional Development Funds, and the Andalusian regional project P18-TP-5035. The authors are grateful to the anonymous reviewers who contributed to improving the quality of the original paper.
© 2021 Elsevier Ltd
- Data resampling
- Feature selection
- Imbalanced data classification
- OWA Operators