SMS Spam detection

We have a collection of spam messages gathered from the Grumbletext website, where users public spam messages. Additionally, we have randomly selected legitimate messages, which were collected for research purposes by the Department of Computer Science at the National University of Singapore. [1]

Exploratory Data Analisys

Let's see some sample messages:

Nah I don't think he goes to usf, he lives around here though
Even my brother is not like to speak with me.
They treat me like aids patent.
I'm gonna be home soon and I don't want to talk about this stuff anymore tonight, k?
I've cried enough today
I HAVE A DATE ON SUNDAY WITH WILL!!
Oh k...I'm watching here :)

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question (std txt rate) T&C's apply 08452810075 over 18's
WINNER!! As a valued network customer, you have been selected to receive a £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
SIX chances to win CASH! From 100 to 20,000 pounds. Text CSH11 and send to 87575. Cost 150p/day, 6 days, 16+ T&C's apply. Reply HL for info.
URGENT! You have won a 1-week FREE membership in our £100,000 Prize Jackpot! Text the word: CLAIM to No: 81010. T&C's www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
Thanks for your subscription to Ringtone UK. Your mobile will be charged £5/month. Please confirm by replying YES or NO. If you reply NO, you will not be charged.

We can see that the legitimate (Ham) messages come from casual conversations, while the spam messages attempt to fool the recipient.

Model

Our approach is to vectorize the sentences, where each unique word will be represented as a binary column, as shown in the next figure.

This enables us to treat each word as a numeric feature, allowing us to use Logistic Regression and Random Forest for classification.

The performance of this models are shown in the following features.

Accuracy: 0.98

Accuracy: 0.97

In the context of this problem, a false positive can be harmful to the sender, as it occurs when the model classifies a legitimate message as spam. This could result in missing important or urgent messages. Conversely, if our goal is to minimize this, we may allow some spam messages to go through. The Logistic Regression model performed excellently, with zero false positives and an overall accuracy of 0.98. Our Random Forest model also achieved great results, with no false positives and a general accuracy of 0.97.

We have successfully trained a useful and robust model to classify spam messages.

Thanks for reading! 🧑‍💻💕