How to Build a Machine Learning-based Email Classifier with Python published 4/22/2023 | 5 min read
Email is an essential mode of communication in today's fast-paced world. However, receiving and sorting through a large number of emails can be a daunting task that can take up valuable time.
In this tutorial, you'll learn how to build a machine learning-based email classifier with Python, which can automatically categorize your emails into folders, making it easier for you to manage your inbox.
Understanding the Problem
Before diving into the details of the implementation, let's first understand the problem that we are trying to solve.
In this scenario, we have a large collection of emails that need to be categorized into different folders based on their content. For instance, an email from a friend can be categorized under a folder titled 'personal', whereas an email from a work colleague could fall under 'professional'.
To build the email classifier, we need to have a set of labeled emails that we can use to train a machine learning model. The model can then be used to classify new incoming emails based on the content.
Building the Email Classifier
To build the email classifier, we will be using Python and its libraries. We will be using the Natural Language Toolkit (NLTK) and the scikit-learn library for machine learning.
Here are the steps that we will be following:
- Extracting and Preprocessing the Data
- Feature Extraction
- Training the Model
- Evaluating the Model
- Classifying New Emails
1. Extracting and Preprocessing the Data
We will start by extracting the emails and labels from our dataset. The dataset contains a collection of emails labeled under different categories. We will use Python's built-in email module to extract the content of each email.
import os
import email
def load_emails_from_directory(path):
emails = []
labels = []
for filename in os.listdir(path):
label = filename.split(".")[0]
labels.append(label)
with open(os.path.join(path, filename), "rb") as f:
email_contents = f.read()
msg = email.message_from_bytes(email_contents)
emails.append(msg)
return emails, labels
Once we have the emails and their labels, we can preprocess the data to transform the raw text data into a format suitable for machine learning. The preprocessing step includes removing stop words, stemming, and lemmatization.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
def preprocess(text):
text = text.lower()
tokens = word_tokenize(text)
tokens = [t for t in tokens if t.isalpha()]
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if not t in stop_words]
stemmer = PorterStemmer()
tokens = [stemmer.stem(t) for t in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(t) for t in tokens]
return ' '.join(tokens)
2. Feature Extraction
Once the data has been preprocessed, we need to transform it into features that can be used by the machine learning model. For this, we will be using the Bag-of-words (BOW) approach.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)
3. Training the Model
With the features extracted, we can now train the machine learning model. We will be using a Naive Bayes algorithm for classification.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)
4. Evaluating the Model
With the machine learning model trained, we can now evaluate its performance on our testing data.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
y_pred = clf.predict(X_test)
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))
print("
Classification Report:
", classification_report(y_test, y_pred))
print("
Accuracy Score:", accuracy_score(y_test, y_pred))
5. Classifying New Emails
With the machine learning model trained and evaluated, we can now use it to classify new emails that we receive.
new_email = preprocess("This is a new email")
new_email_features = vectorizer.transform([new_email])
predicted_label = clf.predict(new_email_features)[0]
Conclusion
In this tutorial, you learned how to build a machine learning-based email classifier with Python. We went through the steps of data preprocessing, feature extraction, training the model, evaluating its performance, and classifying new emails.
By automating the categorization of emails, you can save valuable time and work more efficiently. The techniques covered in this tutorial extend beyond just email classification and can be applied to other text classification problems as well.
You may also like reading: