Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection
Loading...
Date
2021
Journal Title
Journal ISSN
Volume Title
Publisher
iJIM ‒ Vol. 15, No. 17, 2021
Abstract
Abstract—Spamming is one of the challenging problems within social
networks which involves spreading malicious or scam content on a network; this
often leads to a huge loss in the value of real-time social network services, com-
promise the user and system reputation and jeopardize users trust in the system.
Existing methods in spam detection still suffer from misclassification caused by
redundant and irrelevant features in the dataset as a result of high dimensional-
ity. This study presents a novel framework based on a heterogeneous ensemble
method and a hybrid dimensionality reduction technique for spam detection in
micro-blogging social networks. A hybrid of Information Gain (IG) and Principal
Component Analysis (PCA) (dimensionality reduction) was implemented for
the selection of important features and a heterogeneous ensemble consisting of
Naïve Bayes (NB), K Nearest Neighbor (KNN), Logistic Regression (LR) and
Repeated Incremental Pruning to Produce Error Reduction (RIPPER) classifi-
ers based on Average of Probabilities (AOP) was used for spam detection. To
empirically investigate its performance, the proposed framework was applied on
MPI_SWS and SAC’13 Tip spam datasets and the developed models were eval-
uated based on accuracy, precision, recall, f-measure, and area under the curve
(AUC). From the experimental results, the proposed framework (Ensemble +
IG + PCA) outperformed other experimented methods on studied spam datasets.
Specifically, the proposed framework had an average accuracy value of 87.5%,
an average precision score of 0.877, an average recall value of 0.845, an aver-
age F-measure value of 0.872 and an average AUC value of 0.943. Also, the
proposed framework had better performance than some existing approaches.
Consequently, this study has shown that addressing high dimensionality in spam
datasets, in this case, a hybrid of IG and PCA with a heterogeneous ensemble
method can produce a more effective model for detecting spam contents.