Iris Classification → Product Recommendation

Why k-NN?

Imagine moving into a new neighborhood. To understand what life is like, you look at your closest neighbors — if most of them garden, you might start gardening; if most of them enjoy sports, you’ll likely join.

That’s what k-Nearest Neighbors (k-NN) does: it assigns labels or makes predictions for new data by checking which nearby examples it resembles the most.

How k-NN Works

k-NN is one of the simplest yet powerful instance-based learning algorithms. Instead of “training” a model with weights, it memorizes the dataset and makes predictions by comparing a new sample with existing ones.

Key Concepts:

Distance metric: How we define “closeness.” Common ones:
- Euclidean distance → straight line in feature space.
- Manhattan distance → grid-like distance (useful for sparse data).
- Cosine similarity → angle-based similarity (common in text/recommendation).
Number of neighbors (k): Too small → sensitive to noise; too large → oversmooths.
Weights: Uniform (all neighbors equal) or distance-weighted (closer neighbors matter more).
Decision boundaries: Flexible, can adapt to non-linear shapes.
Complexity: Training is instant (just store data), but prediction can be costly for large datasets.

When to use k-NN:

Great for small to medium datasets.
Effective when data is fairly clean, features are meaningful, and relationships are local.
Struggles with irrelevant features and high dimensions (curse of dimensionality).

Toy Problem – Iris Flower Classification

We’ll train a k-NN classifier to recognize three flower species based on their measurements.

Step 0: Setup & Imports

We first import all libraries required for data handling, preprocessing, modeling, and visualization. This ensures we have the right tools in place for each stage.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

Step 1: Load Data & Snapshot

We load the Iris dataset, a balanced dataset with 150 samples. Inspecting the first few rows helps us verify structure and spot missing values.

iris = load_iris(as_frame=True)
X = iris.data
y = iris.target
df = iris.frame

print("Shape:", df.shape)
df.head()

Step 2: Train/Test Split & Scaling

We split data into training and test sets to evaluate performance fairly. Scaling is applied because k-NN relies on distance, and unscaled features can distort results.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

Step 3: Baseline Model

We train a simple k-NN classifier with k=5 to set a baseline. This gives us a first look at the model’s performance before tuning.

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
y_pred = knn.predict(X_test_s)

print("Baseline accuracy:", accuracy_score(y_test, y_pred))

Step 4: Hyperparameter Tuning

We perform grid search to find the best combination of neighbors, distance metrics, and weighting strategies. This ensures our model isn’t under- or overfitting.

param_grid = {
    "n_neighbors": list(range(1, 31)),
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan", "minkowski"]
}

grid = GridSearchCV(KNeighborsClassifier(), param_grid, scoring="accuracy", cv=5, n_jobs=-1)
grid.fit(X_train_s, y_train)

print("Best params:", grid.best_params_)
print("Best CV score:", grid.best_score_)

best_knn = grid.best_estimator_
y_pred_best = best_knn.predict(X_test_s)
print("Test accuracy:", accuracy_score(y_test, y_pred_best))

Step 5: Evaluation & Diagnostics

We generate a classification report and confusion matrix to see how well the model distinguishes between classes. This provides detailed insights beyond accuracy.

print(classification_report(y_test, y_pred_best, target_names=iris.target_names))

cm = confusion_matrix(y_test, y_pred_best)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot()
plt.title("k-NN Confusion Matrix (Iris)")
plt.show()

Step 6: Accuracy vs. k Visualization

We plot cross-validation accuracy across different k values. This helps us see where the model performs best and avoid poor k selections.

ks = range(1, 31)
cv_scores = []

for k in ks:
    model = KNeighborsClassifier(n_neighbors=k, weights="distance")
    scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring="accuracy", n_jobs=-1)
    cv_scores.append(scores.mean())

plt.plot(ks, cv_scores, marker="o")
plt.xlabel("k (neighbors)")
plt.ylabel("CV accuracy")
plt.title("Choosing k via Cross-Validation (Iris)")
plt.grid(True, linestyle="--", alpha=0.5)
plt.show()

Quick Reference: Iris Toy Problem

# --- Full Iris k-NN Pipeline ---
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Load data
iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
df = iris.frame
print(df.head())

# Split + scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_s, X_test_s = scaler.fit_transform(X_train), scaler.transform(X_test)

# Baseline model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_s, y_train)
print("Baseline accuracy:", accuracy_score(y_test, knn.predict(X_test_s)))

# Grid search
param_grid = {"n_neighbors": list(range(1, 31)), "weights": ["uniform", "distance"], "metric": ["euclidean", "manhattan", "minkowski"]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, scoring="accuracy", cv=5, n_jobs=-1)
grid.fit(X_train_s, y_train)
best_knn = grid.best_estimator_
print("Best params:", grid.best_params_)
print("Best test acc:", accuracy_score(y_test, best_knn.predict(X_test_s)))

# Evaluation
print(classification_report(y_test, best_knn.predict(X_test_s), target_names=iris.target_names))
ConfusionMatrixDisplay.from_estimator(best_knn, X_test_s, y_test, display_labels=iris.target_names)
plt.show()

# Accuracy vs k
ks = range(1, 31)
scores = [cross_val_score(KNeighborsClassifier(n_neighbors=k, weights="distance"), X_train_s, y_train, cv=5).mean() for k in ks]
plt.plot(ks, scores, marker="o")
plt.xlabel("k"); plt.ylabel("CV Accuracy"); plt.title("Accuracy vs k"); plt.show()

Real‑World Application — Product Recommendation with k-NN

We’ll simulate a basic recommender system using item-based k-NN and cosine similarity.

Step 1: Build a User–Item Matrix

We create a toy dataset where rows are users and columns are items. Each cell indicates whether the user interacted with the item.

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

M = pd.DataFrame(
    [
        [1, 1, 0, 0, 0],  # User A
        [1, 0, 1, 0, 0],  # User B
        [0, 1, 1, 1, 0],  # User C
        [0, 0, 1, 1, 1],  # User D
    ],
    columns=["Item_Apple", "Item_Banana", "Item_Carrot", "Item_Donut", "Item_Eclair"],
    index=["User_A", "User_B", "User_C", "User_D"]
)

Step 2: Compute Item Similarities

We calculate cosine similarity between item vectors. This tells us which items are most alike based on user behavior.

item_sim = pd.DataFrame(
    cosine_similarity(M.T),
    index=M.columns,
    columns=M.columns
)
print(item_sim.round(2))

Step 3: Recommend Similar Items

We define a function to recommend the top-k most similar items for a given product. This is the foundation of item-based collaborative filtering.

def recommend_similar_items(item_id, top_k=3):
    sims = item_sim.loc[item_id].drop(item_id).sort_values(ascending=False)
    return sims.head(top_k)

print("Similar to Item_Carrot:")
print(recommend_similar_items("Item_Carrot", top_k=3))

Full Code Collection: Product Recommendation

# --- Item-based k-NN Recommendation ---
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# User-item matrix
M = pd.DataFrame(
    [
        [1, 1, 0, 0, 0],
        [1, 0, 1, 0, 0],
        [0, 1, 1, 1, 0],
        [0, 0, 1, 1, 1],
    ],
    columns=["Item_Apple", "Item_Banana", "Item_Carrot", "Item_Donut", "Item_Eclair"],
    index=["User_A", "User_B", "User_C", "User_D"]
)

# Compute similarities
item_sim = pd.DataFrame(cosine_similarity(M.T), index=M.columns, columns=M.columns)

# Recommend function
def recommend_similar_items(item_id, top_k=3):
    sims = item_sim.loc[item_id].drop(item_id).sort_values(ascending=False)
    return sims.head(top_k)

print("Similar to Item_Carrot:")
print(recommend_similar_items("Item_Carrot", top_k=3))

How you’d productionize:

Choose user‑based or item‑based k‑NN.
Use cosine for sparse implicit data (clicks), euclidean if dense/continuous.
For each user, score candidate items by a similarity‑weighted sum from the k nearest items they interacted with.
Add popularity priors, time decay, and diversity filters.
Evaluate with offline metrics (Precision@K, Recall@K, MAP, NDCG) and A/B tests.

Strengths & Limitations

Strengths

Simplicity → Easy to understand and implement; no complex training phase.
Flexibility → Works for both classification and regression.
Non-parametric → Can model complex, irregular decision boundaries.
Adaptability → Naturally supports multi-class problems without modification.
Transparency → Predictions can be explained by showing the nearest examples.

Limitations

Scalability → Prediction time grows with dataset size, since distances must be computed to all points.
Curse of dimensionality → Performance degrades as the number of features grows.
Sensitivity to noise → Outliers or mislabeled data can mislead predictions.
Feature scaling dependency → Requires normalization/standardization to work correctly.
Imbalanced data issues → Majority classes can dominate neighbor votes.

Final Notes

Dk-NN is one of the most intuitive algorithms in machine learning. It shows how simple ideas like “check your neighbors” can achieve strong predictive power. While it is rarely the best choice for very large or high-dimensional datasets, k-NN is excellent as a baseline model, a teaching tool, or for applications where interpretability matters.

In practice, k-NN often serves as a benchmark before deploying more sophisticated algorithms. With the right scaling, distance metric, and hyperparameters, it can still perform competitively in real-world tasks like recommendation systems, image recognition, and anomaly detection.

Next Steps for You:

Experiment with distance metrics and hyperparameters → Try different values of k and distance measures (Euclidean, Manhattan, Cosine) to see how results change. This builds intuition on how k-NN adapts to different data.

Scale to real datasets → Apply k-NN to larger problems like image classification (MNIST) or recommender systems (MovieLens). This shows its strengths and limitations in more realistic settings.

References

[1] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
[2] scikit‑learn developers, “sklearn.neighbors.KNeighborsClassifier — scikit‑learn 1.x documentation,” Accessed: Aug. 24, 2025.
[3] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[4] F. Pedregosa et al., “Scikit‑learn: Machine Learning in Python,” JMLR, vol. 12, pp. 2825–2830, 2011.
[5] G. Shani and A. Gunawardana, “Evaluating recommendation systems,” in Recommender Systems Handbook, Springer, 2011.

k-Nearest Neighbors (k-NN): Learning from Neighbors

Why k-NN?

How k-NN Works

Key Concepts:

When to use k-NN:

Toy Problem – Iris Flower Classification

Step 0: Setup & Imports

Step 1: Load Data & Snapshot

Step 2: Train/Test Split & Scaling

Step 3: Baseline Model

Step 4: Hyperparameter Tuning

Step 5: Evaluation & Diagnostics

Step 6: Accuracy vs. k Visualization

Quick Reference: Iris Toy Problem

Real‑World Application — Product Recommendation with k-NN

Step 1: Build a User–Item Matrix

Step 2: Compute Item Similarities

Step 3: Recommend Similar Items

Full Code Collection: Product Recommendation

Strengths & Limitations

Strengths

Limitations

Final Notes

References

Leave a comment Cancel reply

k-Nearest Neighbors (k-NN): Learning from Neighbors

Why k-NN?

How k-NN Works

Key Concepts:

When to use k-NN:

Toy Problem – Iris Flower Classification

Step 0: Setup & Imports

Step 1: Load Data & Snapshot

Step 2: Train/Test Split & Scaling

Step 3: Baseline Model

Step 4: Hyperparameter Tuning

Step 5: Evaluation & Diagnostics

Step 6: Accuracy vs. k Visualization

Quick Reference: Iris Toy Problem

Real‑World Application — Product Recommendation with k-NN

Step 1: Build a User–Item Matrix

Step 2: Compute Item Similarities

Step 3: Recommend Similar Items

Full Code Collection: Product Recommendation

Strengths & Limitations

Strengths

Limitations

Final Notes

References

Share this:

Leave a comment Cancel reply