Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

Abstract
This research work investigates the deployment of data sampling and ensemble techniques in alleviating the class imbalance problem in software defect prediction (SDP). Specifically, the effect of data sampling techniques on the performance of ensemble methods is investigated. The experiments were conducted using software defect datasets from the NASA software archives. Five data sampling methods (over-sampling techniques (SMOTE, ADASYN, and ROS), and undersampling techniques (RUS and NearMiss) were combined with bagging and boosting ensemble methods based on Naïve Bayes (NB) and Decision Tree (DT) classifier. Predictive performances of developed models were assessed based on the area under the curve (AUC), and Matthew’s correlation coefficient (MCC) values. From the experimental findings, it was observed that the implementation of data sampling methods further enhanced the predictive performances of the experimented ensemble methods. Specifically, BoostedDT on the ROS-balanced datasets recorded the highest average AUC (0.995), and MCC (0.918) values respectively. Aside NearMiss method, which worked best with the Bagging ensemble method, other studied data sampling methods worked well with the Boosting ensemble technique. Also, some of the developed models particularly BoostedDT showed better prediction performance over existing SDP models. As a result, combining data sampling techniques with ensemble methods may not only improve SDP model prediction performance but also provide a plausible solution to the latent class imbalance issue in SDP processes.
Description
Keywords
Citation
7. Balogun, A. O, Odejide, B. J., Bajeh, A. O., Alanamu, Z. A., Usman-Hamza, F. E., Adeleke, H. O., Mabayoje, M. A and Yusuff, S. R (2022). Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction. In: Computational Science and Its Applications – ICCSA 2022 Workshops. ICCSA 2022. Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds) 363–379. Lecture Notes in Computer Science, vol 13381. Published by Springer, Cham. ISBN: 978-3-031-10547-0 URL: https://doi.org/10.1007/978-3-031-10548-7_27