How to Build a Machine Learning-based Email Classifier with Python published 4/22/2023 | 5 min read

Email is an essential mode of communication in today's fast-paced world. However, receiving and sorting through a large number of emails can be a daunting task that can take up valuable time.

In this tutorial, you'll learn how to build a machine learning-based email classifier with Python, which can automatically categorize your emails into folders, making it easier for you to manage your inbox.



Understanding the Problem

Before diving into the details of the implementation, let's first understand the problem that we are trying to solve.

In this scenario, we have a large collection of emails that need to be categorized into different folders based on their content. For instance, an email from a friend can be categorized under a folder titled 'personal', whereas an email from a work colleague could fall under 'professional'.

To build the email classifier, we need to have a set of labeled emails that we can use to train a machine learning model. The model can then be used to classify new incoming emails based on the content.

Building the Email Classifier

To build the email classifier, we will be using Python and its libraries. We will be using the Natural Language Toolkit (NLTK) and the scikit-learn library for machine learning.

Here are the steps that we will be following:

  1. Extracting and Preprocessing the Data
  2. Feature Extraction
  3. Training the Model
  4. Evaluating the Model
  5. Classifying New Emails


1. Extracting and Preprocessing the Data

We will start by extracting the emails and labels from our dataset. The dataset contains a collection of emails labeled under different categories. We will use Python's built-in email module to extract the content of each email.

  
import os
import email

def load_emails_from_directory(path):
    emails = []
    labels = []
    for filename in os.listdir(path):
        label = filename.split(".")[0]
        labels.append(label)
        with open(os.path.join(path, filename), "rb") as f:
            email_contents = f.read()
            msg = email.message_from_bytes(email_contents)
            emails.append(msg)
    return emails, labels

Once we have the emails and their labels, we can preprocess the data to transform the raw text data into a format suitable for machine learning. The preprocessing step includes removing stop words, stemming, and lemmatization.

  
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

def preprocess(text):
    text = text.lower() # convert all text to lowercase
    tokens = word_tokenize(text) # tokenizing
    tokens = [t for t in tokens if t.isalpha()] # remove non-alphabetic tokens
    stop_words = set(stopwords.words('english')) # stop words that don't add much meaning
    tokens = [t for t in tokens if not t in stop_words] # removing stop words
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens] # stemming
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens] # lemmatization
    return ' '.join(tokens)



2. Feature Extraction

Once the data has been preprocessed, we need to transform it into features that can be used by the machine learning model. For this, we will be using the Bag-of-words (BOW) approach.

  
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)

3. Training the Model

With the features extracted, we can now train the machine learning model. We will be using a Naive Bayes algorithm for classification.

  
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Train the Naive Bayes model
clf = MultinomialNB()
clf.fit(X_train, y_train)



4. Evaluating the Model

With the machine learning model trained, we can now evaluate its performance on our testing data.

  
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))
print("

Classification Report:
", classification_report(y_test, y_pred))
print("

Accuracy Score:", accuracy_score(y_test, y_pred))

5. Classifying New Emails

With the machine learning model trained and evaluated, we can now use it to classify new emails that we receive.

  
# Extract features from the new email
new_email = preprocess("This is a new email")
new_email_features = vectorizer.transform([new_email])

# Use the trained model to classify the new email
predicted_label = clf.predict(new_email_features)[0]



Conclusion

In this tutorial, you learned how to build a machine learning-based email classifier with Python. We went through the steps of data preprocessing, feature extraction, training the model, evaluating its performance, and classifying new emails.

By automating the categorization of emails, you can save valuable time and work more efficiently. The techniques covered in this tutorial extend beyond just email classification and can be applied to other text classification problems as well.



You may also like reading: