Image Color Quantization → Customer Segmentation

Why K-Means Clustering?
Have you ever tried to simplify a photo by reducing its colors into just a handful of shades, like a digital painting? Or wondered how businesses group customers into clusters like “budget shoppers,” “occasional buyers,” and “loyal customers” without asking them directly?
Both of these problems can be solved using K-means clustering. It’s an unsupervised learning algorithm that automatically finds structure in data by grouping similar points together.
In this tutorial:
- We’ll start with a toy problem: simplifying an image using color quantization.
- Then, we’ll scale up to a real-world application: segmenting customers based on their behavior.
How K-Means Clustering Works
K-means is one of the most widely used unsupervised machine learning algorithms for clustering. Unlike supervised learning (where data has labels), unsupervised learning tries to uncover hidden structure without predefined categories.
The K-Means Algorithm
The goal is to partition data points into
clusters, where each data point belongs to the cluster with the nearest mean (centroid).
Steps of the algorithm:
- Initialization: Choose
cluster centroids randomly.
- Assignment Step: Assign each data point to the closest centroid using Euclidean distance:
where
is a data point and
is a centroid.
- Update Step: Recalculate centroids as the mean of all points assigned to that cluster:
- Repeat until assignments no longer change or a maximum number of iterations is reached.
The Cost Function (Objective Function)
K-means minimizes the sum of squared distances (SSD) between each data point and its cluster centroid:
The lower this value, the better the clustering.
Choosing the Right K
How do we pick the number of clusters ?
- Elbow Method: Plot cost
vs.
and look for the “elbow” point.
- Silhouette Score: Measures how well-separated clusters are.
K-means works best when:
- Clusters are spherical and similar in size.
- Data is numeric and continuous.
Toy Problem – Image Color Quantization
In image processing, each pixel has a color represented by RGB values. A high-resolution photo may contain thousands of unique colors. K-means clustering can reduce this to just a few dominant colors while still preserving the essence of the image.
Dataset Snapshot
We’ll use a sample image with rich colors (e.g., a landscape photo). Each pixel is represented as:
| Pixel | R | G | B |
|---|---|---|---|
| 1 | 125 | 200 | 80 |
| 2 | 130 | 205 | 85 |
| … | … | … | … |
Our goal: Cluster these RGB values into K dominant colors.
Step 1: Import Libraries
We import the necessary libraries: NumPy for computation, Matplotlib for visualization, and KMeans from Scikit-learn for clustering.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from skimage import io
Step 2: Load the Image
We load and display the image. Each pixel will later be clustered based on its RGB values.
image = io.imread('sample_image.jpg')
plt.imshow(image)
plt.axis('off')
plt.show()
Step 3: Reshape Image Data
We convert the image into a 2D array where each row is a pixel’s RGB values.
pixels = image.reshape(-1, 3) # Flatten into (num_pixels, 3)
Step 4: Apply K-Means
We cluster pixels into 8 color groups and assign each pixel its closest centroid color.
k = 8 # number of colors
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(pixels)
new_colors = kmeans.cluster_centers_[kmeans.labels_]
Step 5: Reconstruct the Image
We replace each pixel with its cluster centroid, creating a simplified image.
quantized_img = new_colors.reshape(image.shape).astype(np.uint8)
plt.imshow(quantized_img)
plt.axis('off')
plt.show()
Quick Reference: Image Color Quantization Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from skimage import io
image = io.imread('sample_image.jpg')
pixels = image.reshape(-1, 3)
k = 8
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(pixels)
new_colors = kmeans.cluster_centers_[kmeans.labels_]
quantized_img = new_colors.reshape(image.shape).astype(np.uint8)
plt.imshow(quantized_img)
plt.axis('off')
plt.show()
Real‑World Application — Customer Segmentation
Now let’s apply K-means to a business scenario: grouping customers into meaningful segments based on purchasing behavior.
Dataset Snapshot
Suppose we have the following customer dataset:
| CustomerID | Age | Annual Income (k$) | Spending Score (1–100) |
|---|---|---|---|
| 1 | 19 | 15 | 39 |
| 2 | 21 | 15 | 81 |
| 3 | 20 | 16 | 6 |
| … | … | … | … |
Our goal: Find groups like “high income–low spenders” or “young high spenders.”
Step 1: Import Libraries and Load Data
We load the dataset containing customer demographics and spending patterns.
import pandas as pd
data = pd.read_csv("Mall_Customers.csv")
data.head()
Step 2: Select Relevant Features
We focus on income and spending score for clustering (2D for easy visualization).
X = data[['Annual Income (k$)', 'Spending Score (1-100)']]
Step 3: Apply K-Means
We cluster customers into 5 groups based on similarities in spending and income.
kmeans = KMeans(n_clusters=5, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)
Step 4: Visualize Clusters
We visualize how K-means grouped customers into distinct segments.
plt.scatter(X['Annual Income (k$)'], X['Spending Score (1-100)'],
c=data['Cluster'], cmap='rainbow')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segmentation')
plt.show()
Full Code Collection: Customer Segmentation Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
data = pd.read_csv("Mall_Customers.csv")
X = data[['Annual Income (k$)', 'Spending Score (1-100)']]
kmeans = KMeans(n_clusters=5, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)
plt.scatter(X['Annual Income (k$)'], X['Spending Score (1-100)'],
c=data['Cluster'], cmap='rainbow')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segmentation')
plt.show()
Strengths & Limitations
Strengths
- Simple and efficient for large datasets.
- Works well when clusters are spherical and well-separated.
- Easy to implement and interpret.
Limitations
- Requires predefining number of clusters
.
- Sensitive to outliers and noise.
- Assumes equal-sized clusters, which may not fit real-world data.
Final Notes
In this tutorial, we explored K-means clustering from both a visual toy problem (image color quantization) and a business scenario (customer segmentation).
We learned:
- The theory behind K-means and its optimization objective.
- How to apply clustering step-by-step to datasets.
- The practical value of clustering in reducing complexity and uncovering hidden groups.
K-means is a fundamental building block in unsupervised learning, serving as the basis for more advanced clustering methods.
Next Steps for You:
Try applying K-means to text clustering (e.g., grouping news articles).
Explore advanced clustering algorithms like DBSCAN or Gaussian Mixture Models for non-spherical clusters.
References
[1] J. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.
[2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2009.
[3] Scikit-learn Documentation – Clustering, https://scikit-learn.org.

Leave a comment