SMS Spam detection

We have a collection of spam messages gathered from the Grumbletext website, where users public spam messages. Additionally, we have randomly selected legitimate messages, which were collected for research purposes by the Department of Computer Science at the National University of Singapore. [1]

Exploratory Data Analisys

Let's see some sample messages:

We can see that the legitimate (Ham) messages come from casual conversations, while the spam messages attempt to fool the recipient.

Model

Our approach is to vectorize the sentences, where each unique word will be represented as a binary column, as shown in the next figure.

This enables us to treat each word as a numeric feature, allowing us to use Logistic Regression and Random Forest for classification.

The performance of this models are shown in the following features.

Accuracy: 0.98

Accuracy: 0.97

In the context of this problem, a false positive can be harmful to the sender, as it occurs when the model classifies a legitimate message as spam. This could result in missing important or urgent messages. Conversely, if our goal is to minimize this, we may allow some spam messages to go through. The Logistic Regression model performed excellently, with zero false positives and an overall accuracy of 0.98. Our Random Forest model also achieved great results, with no false positives and a general accuracy of 0.97.

We have successfully trained a useful and robust model to classify spam messages.

Thanks for reading! 🧑‍💻💕