Email is an essential mode of communication in today's fast-paced world. However, receiving and sorting through a large number of emails can be a daunting task that can take up valuable time.
In this tutorial, you'll learn how to build a machine learning-based email classifier with Python, which can automatically categorize your emails into folders, making it easier for you to manage your inbox.
Before diving into the details of the implementation, let's first understand the problem that we are trying to solve.
In this scenario, we have a large collection of emails that need to be categorized into different folders based on their content. For instance, an email from a friend can be categorized under a folder titled 'personal', whereas an email from a work colleague could fall under 'professional'.
To build the email classifier, we need to have a set of labeled emails that we can use to train a machine learning model. The model can then be used to classify new incoming emails based on the content.
To build the email classifier, we will be using Python and its libraries. We will be using the Natural Language Toolkit (NLTK) and the scikit-learn library for machine learning.
Here are the steps that we will be following:
We will start by extracting the emails and labels from our dataset. The dataset contains a collection of emails labeled under different categories. We will use Python's built-in email module to extract the content of each email.
import os
import email
def load_emails_from_directory(path):
emails = []
labels = []
for filename in os.listdir(path):
label = filename.split(".")[0]
labels.append(label)
with open(os.path.join(path, filename), "rb") as f:
email_contents = f.read()
msg = email.message_from_bytes(email_contents)
emails.append(msg)
return emails, labels
Once we have the emails and their labels, we can preprocess the data to transform the raw text data into a format suitable for machine learning. The preprocessing step includes removing stop words, stemming, and lemmatization.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
def preprocess(text):
text = text.lower() # convert all text to lowercase
tokens = word_tokenize(text) # tokenizing
tokens = [t for t in tokens if t.isalpha()] # remove non-alphabetic tokens
stop_words = set(stopwords.words('english')) # stop words that don't add much meaning
tokens = [t for t in tokens if not t in stop_words] # removing stop words
stemmer = PorterStemmer()
tokens = [stemmer.stem(t) for t in tokens] # stemming
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(t) for t in tokens] # lemmatization
return ' '.join(tokens)
Once the data has been preprocessed, we need to transform it into features that can be used by the machine learning model. For this, we will be using the Bag-of-words (BOW) approach.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)
With the features extracted, we can now train the machine learning model. We will be using a Naive Bayes algorithm for classification.
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
# Train the Naive Bayes model
clf = MultinomialNB()
clf.fit(X_train, y_train)
With the machine learning model trained, we can now evaluate its performance on our testing data.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Evaluate the performance of the classifier
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\n\nClassification Report:\n", classification_report(y_test, y_pred))
print("\n\nAccuracy Score:", accuracy_score(y_test, y_pred))
With the machine learning model trained and evaluated, we can now use it to classify new emails that we receive.
# Extract features from the new email
new_email = preprocess("This is a new email")
new_email_features = vectorizer.transform([new_email])
# Use the trained model to classify the new email
predicted_label = clf.predict(new_email_features)[0]
In this tutorial, you learned how to build a machine learning-based email classifier with Python. We went through the steps of data preprocessing, feature extraction, training the model, evaluating its performance, and classifying new emails.
By automating the categorization of emails, you can save valuable time and work more efficiently. The techniques covered in this tutorial extend beyond just email classification and can be applied to other text classification problems as well.
2092 words authored by Gen-AI! So please do not take it seriously, it's just for fun!