Naïve Bayes: Fast, Interpretable Text Classification

7 minutes

SMS Spam Filter → Email Spam Detection

What You’ll Build

We’ll train a Naïve Bayes classifier that flags spam versus ham (legitimate) messages. You’ll see why it’s fast, robust on small data, and surprisingly strong for text tasks—then scale the same idea to email spam detection.

How Naïve Bayes Works

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem and the naïve (conditional independence) assumption among features given the class.

Bayes’ Theorem

P(C \mid \mathbf{x})=\dfrac{P(\mathbf{x}\mid C)\,P(C)}{P(\mathbf{x})}

For classification we compare classes via the unnormalized posterior:

P(C \mid \mathbf{x}) \propto P(\mathbf{x}\mid C)\,P(C)

Naïve (Conditional Independence) Assumption

If \mathbf{x}=(x_1,\ldots,x_n) are features (e.g., word counts), then:

P(\mathbf{x}\mid C)=\prod_{j=1}^{n} P(x_j \mid C)

Decision Rule (MAP)
Pick the class with the largest posterior:
\hat{C}=\arg\max_{C \in \mathcal{Y}} \; P(C)\,\prod_{j=1}^{n} P(x_j\mid C)

To avoid underflow and turn products into sums, we use logs:
\hat{C}=\arg\max_{C} \; \log P(C)+\sum_{j=1}^{n}\log P(x_j\mid C)

Multinomial Naïve Bayes for Text

For bag‑of‑words counts, Multinomial NB models word occurrences. With Laplace (add‑\alpha) smoothing:

P(w\mid C)=\dfrac{N_{w,C}+\alpha}{\sum_{w'} (N_{w',C}+\alpha)}

  • N_{w,C} = total count of word w in class C across training docs
  • \alpha = smoothing parameter (common default: \alpha=1)
  • Class prior: P(C)=\frac{N_C}{N}

Why it works well for text

With smoothing, handles unseen words gracefully.

Words (features) are sparse and high‑dimensional → independence assumption is often “good enough.”

Training is one pass over counts → fast, even on large corpora.

Toy Problem – SMS Spam Filter

We’ll train a model to classify SMS messages as spam or ham (legitimate). Using the SMS Spam Collection dataset (5,000+ labeled messages), we’ll turn text into features and apply Naïve Bayes to detect patterns like common spam words (“win”, “free”, “offer”).

This simple setup is a classic way to learn how text classification works in practice.

Step 0: Setup & Imports

Prepare libraries and set a reproducible environment.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, RocCurveDisplay, roc_auc_score
import matplotlib.pyplot as plt
import re

Step 1: Load Data

Read the SMS dataset and do a quick sanity check.

df = pd.read_csv("spam.csv")  # expects columns like: label,text
df = df.rename(columns={df.columns[0]: "label", df.columns[1]: "text"})  # if needed
df = df[["label", "text"]].dropna()
print(df.head())
print(df['label'].value_counts())

Step 2: Simple Text Cleaning (optional but helpful)

Normalize text (lowercase, strip URLs/punctuation) to reduce noise.

def clean_text(s):
    s = s.lower()
    s = re.sub(r"http\S+|www\.\S+", " URL ", s)
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_clean"] = df["text"].astype(str).apply(clean_text)

Step 3: Train/Test Split

Keep a hold‑out set to estimate generalization.

X_train, X_test, y_train, y_test = train_test_split(
    df["text_clean"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

Step 4: Build Pipeline (TF‑IDF → MultinomialNB)

End‑to‑end vectorization + model; easy to train/tune/deploy.

nb_clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)),
    ("nb", MultinomialNB(alpha=1.0))
])

nb_clf.fit(X_train, y_train)

Step 5: Evaluate

Understand precision/recall trade‑offs; check for bias toward “ham”.

y_pred = nb_clf.predict(X_test)
print(classification_report(y_test, y_pred))

# Optional ROC-AUC (treat 'spam' as positive)
y_prob = nb_clf.predict_proba(X_test)[:, list(nb_clf.classes_).index("spam")]
print("ROC AUC:", roc_auc_score((y_test=="spam").astype(int), y_prob))

Step 6: Inspect Top Tokens (Model Intuition)

See which words the model thinks are most indicative of spam vs ham.

vec = nb_clf.named_steps["tfidf"]
clf = nb_clf.named_steps["nb"]

feature_names = np.array(vec.get_feature_names_out())
log_prob = clf.feature_log_prob_  # shape: [n_classes, n_features]
classes = clf.classes_

for ci, cname in enumerate(classes):
    top_i = np.argsort(log_prob[ci])[-15:]
    print(f"Top tokens for class '{cname}':", feature_names[top_i])

Step 7: Try ComplementNB (good for imbalanced text)

ComplementNB can be more stable when classes are skewed.

cnb_clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)),
    ("cnb", ComplementNB(alpha=0.5))
])
cnb_clf.fit(X_train, y_train)
print(classification_report(y_test, cnb_clf.predict(X_test)))

Step 8: Threshold Tuning (optional)

If catching spam is critical, you can raise sensitivity by lowering threshold.

spam_index = list(nb_clf.classes_).index("spam")
probs = nb_clf.predict_proba(X_test)[:, spam_index]

def predict_with_threshold(p, t=0.5):
    return np.where(p >= t, "spam", "ham")

for t in [0.3, 0.5, 0.7]:
    print(f"\n== Threshold {t} ==")
    y_pred_t = predict_with_threshold(probs, t)
    print(classification_report(y_test, y_pred_t))

Step 9: Save & Load (deployment-ready)

Persist the trained pipeline for reuse in apps.

import joblib
joblib.dump(nb_clf, "sms_spam_nb.joblib")

# later
# nb_clf = joblib.load("sms_spam_nb.joblib")
# nb_clf.predict(["free entry in 2 a wkly comp to win..."])

Quick Reference: SMS Spam Filter

# sms_nb_full.py

import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, roc_auc_score
import joblib

# 1) Load
df = pd.read_csv("spam.csv")  # columns: label,text
df = df.rename(columns={df.columns[0]: "label", df.columns[1]: "text"})
df = df[["label", "text"]].dropna()

# 2) Clean
def clean_text(s):
    s = s.lower()
    s = re.sub(r"http\S+|www\.\S+", " URL ", s)
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_clean"] = df["text"].astype(str).apply(clean_text)

# 3) Split
X_train, X_test, y_train, y_test = train_test_split(
    df["text_clean"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

# 4) Model: TF-IDF + MultinomialNB
nb_clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)),
    ("nb", MultinomialNB(alpha=1.0))
])
nb_clf.fit(X_train, y_train)

# 5) Evaluate
y_pred = nb_clf.predict(X_test)
print(classification_report(y_test, y_pred))

# AUC (spam as positive)
spam_index = list(nb_clf.classes_).index("spam")
y_prob = nb_clf.predict_proba(X_test)[:, spam_index]
print("ROC AUC:", roc_auc_score((y_test=="spam").astype(int), y_prob))

# 6) Inspect tokens
vec = nb_clf.named_steps["tfidf"]
clf = nb_clf.named_steps["nb"]
feature_names = np.array(vec.get_feature_names_out())
for ci, cname in enumerate(clf.classes_):
    top_i = np.argsort(clf.feature_log_prob_[ci])[-15:]
    print(f"Top tokens for class '{cname}':", feature_names[top_i])

# 7) Alternative: ComplementNB (often better on skew)
cnb_clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)),
    ("cnb", ComplementNB(alpha=0.5))
])
cnb_clf.fit(X_train, y_train)
print("\nComplementNB:\n", classification_report(y_test, cnb_clf.predict(X_test)))

# 8) Save model
joblib.dump(nb_clf, "sms_spam_nb.joblib")

Real‑World Application — Email Spam Detection (Scaling Up)

Email spam is richer than SMS: subject, body (HTML/plain), headers (From/To/Return‑Path/Received), attachments, embedded links, and reputation signals. Below is a practical blueprint.

What changes from SMS?

  • Richer features: subject + body tokens, header anomalies (e.g., suspicious Return-Path), domain reputation, URL presence/count, HTML features (e.g., many <a> tags), character encodings, excessive punctuation/caps.
  • Class imbalance: Legitimate emails >> spam; handle with ComplementNB, threshold tuning, or sampling strategies.
  • Evasion tactics: Obfuscation (v1@gr@) and templated campaigns → keep tokenization and normalization robust; consider character n‑grams.

Minimal End-to-End Pipeline (Template)

Step 1: Data schema

Organize inputs (subject, body, headers) for clean feature engineering.

emails = pd.read_json("emails.jsonl", lines=True)  # columns: label, subject, body, headers
emails = emails.dropna(subset=["label", "body"]).reset_index(drop=True)

Step 2: Feature function

Combine textual and simple structural signals.

def email_to_text(e):
    subj = e.get("subject", "")
    body = e.get("body", "")
    headers = e.get("headers", {})
    from_dom = headers.get("from_domain", "")
    url_count = headers.get("url_count", 0)
    meta = f" fromdomain_{from_dom} urlcount_{url_count} "
    return f"{subj} {body} {meta}"
    
emails["text_all"] = emails.apply(email_to_text, axis=1)

Step 3: Train/Test split and model

Same NB backbone, now on combined features; ComplementNB often shines here.

X_train, X_test, y_train, y_test = train_test_split(
    emails["text_all"], emails["label"], test_size=0.2, random_state=42, stratify=emails["label"]
)

email_clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.9)),
    ("cnb", ComplementNB(alpha=0.5))
])

email_clf.fit(X_train, y_train)
print(classification_report(y_test, email_clf.predict(X_test)))

Step 4: Hardening & Ops

Make it production‑friendly.

  • Continuous learning: Periodically retrain with latest labeled emails (campaigns evolve).
  • Threshold per segment: Different thresholds for high‑risk sources or newly seen domains.
  • Quarantine + human‑in‑the‑loop: Route low‑confidence cases to review.
  • Monitoring: Track precision/recall, drift in token distributions, and false‑positive rate on VIP mail.

Full Code Collection: Email Spam

# email_nb_template.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report
import json

# 1) Load emails: columns: label, subject, body, headers (dict)
emails = pd.read_json("emails.jsonl", lines=True)
emails = emails.dropna(subset=["label", "body"]).reset_index(drop=True)

# 2) Feature builder: subject + body + simple header meta
def email_to_text(row):
    subject = row.get("subject", "")
    body = row.get("body", "")
    headers = row.get("headers", {}) or {}
    from_dom = headers.get("from_domain", "")
    url_count = headers.get("url_count", 0)
    meta = f" fromdomain_{from_dom} urlcount_{url_count} "
    return f"{subject} {body} {meta}"

emails["text_all"] = emails.apply(email_to_text, axis=1)

# 3) Split
X_train, X_test, y_train, y_test = train_test_split(
    emails["text_all"], emails["label"], test_size=0.2, random_state=42, stratify=emails["label"]
)

# 4) Model: TF-IDF + ComplementNB
clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.9)),
    ("cnb", ComplementNB(alpha=0.5))
])

clf.fit(X_train, y_train)

# 5) Evaluate
print(classification_report(y_test, clf.predict(X_test)))

Strengths & Limitations

Strengths

  • Fast and lightweight: Trains in seconds; low memory footprint.
  • Surprisingly strong on text: Bag‑of‑words + NB is a tough baseline to beat.
  • Interpretable tokens: Easy to surface top indicative words for each class.

Limitations

  • Independence assumption: Ignores word dependencies/ordering.
  • Feature sensitivity: Performance depends on good tokenization/normalization.
  • Linear decision surfaces in log‑space: May underfit when complex interactions matter.

Final Notes

Start simple with MultinomialNB/ComplementNB and TF‑IDF. If you need more accuracy later, try character n‑grams, better normalization, or step up to linear SVM/LogReg—but keep NB as your fast baseline.

Next Steps for You:

Add character n‑grams and compare NB vs Linear SVM on the same TF‑IDF features.

Implement continuous retraining with drift detection to keep up with new spam campaigns.

References

[1] T. M. Mitchell, Machine Learning. McGraw-Hill, 1997, ch. 6 (Naïve Bayes).
[2] A. McCallum and K. Nigam, “A comparison of event models for Naive Bayes text classification,” AAAI-98 Workshop on Learning for Text Categorization, 1998.
[3] J. Rennie, L. Shih, J. Teevan, and D. Karger, “Tackling the poor assumptions of naive Bayes text classifiers,” ICML, 2003.
[4] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, 2002.
[5] Scikit‑learn user guide: Naïve Bayes (Multinomial and Complement NB).

Leave a comment