K-Means Clustering: Finding Hidden Patterns in Data

4–6 minutes

Image Color Quantization → Customer Segmentation

Why K-Means Clustering?

Have you ever tried to simplify a photo by reducing its colors into just a handful of shades, like a digital painting? Or wondered how businesses group customers into clusters like “budget shoppers,” “occasional buyers,” and “loyal customers” without asking them directly?

Both of these problems can be solved using K-means clustering. It’s an unsupervised learning algorithm that automatically finds structure in data by grouping similar points together.

In this tutorial:

  • We’ll start with a toy problem: simplifying an image using color quantization.
  • Then, we’ll scale up to a real-world application: segmenting customers based on their behavior.

How K-Means Clustering Works

K-means is one of the most widely used unsupervised machine learning algorithms for clustering. Unlike supervised learning (where data has labels), unsupervised learning tries to uncover hidden structure without predefined categories.

The K-Means Algorithm

The goal is to partition n data points into K clusters, where each data point belongs to the cluster with the nearest mean (centroid).

Steps of the algorithm:

  1. Initialization: Choose K cluster centroids randomly.
  2. Assignment Step: Assign each data point to the closest centroid using Euclidean distance: d(x, \mu) = \sqrt{ \sum_{i=1}^n (x_i - \mu_i)^2 } where x is a data point and \mu is a centroid.
  3. Update Step: Recalculate centroids as the mean of all points assigned to that cluster: \mu_j = \dfrac{1}{|C_j|} \sum_{x \in C_j} x
  4. Repeat until assignments no longer change or a maximum number of iterations is reached.

The Cost Function (Objective Function)

K-means minimizes the sum of squared distances (SSD) between each data point and its cluster centroid:

J = \sum_{j=1}^{K} \sum_{x \in C_j} | x - \mu_j |^2

The lower this value, the better the clustering.

Choosing the Right K

How do we pick the number of clusters K?

  • Elbow Method: Plot cost J vs. K and look for the “elbow” point.
  • Silhouette Score: Measures how well-separated clusters are.

K-means works best when:

  • Clusters are spherical and similar in size.
  • Data is numeric and continuous.

Toy Problem – Image Color Quantization

In image processing, each pixel has a color represented by RGB values. A high-resolution photo may contain thousands of unique colors. K-means clustering can reduce this to just a few dominant colors while still preserving the essence of the image.

Dataset Snapshot

We’ll use a sample image with rich colors (e.g., a landscape photo). Each pixel is represented as:

PixelRGB
112520080
213020585

Our goal: Cluster these RGB values into K dominant colors.

Step 1: Import Libraries

We import the necessary libraries: NumPy for computation, Matplotlib for visualization, and KMeans from Scikit-learn for clustering.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from skimage import io

Step 2: Load the Image

We load and display the image. Each pixel will later be clustered based on its RGB values.

image = io.imread('sample_image.jpg')
plt.imshow(image)
plt.axis('off')
plt.show()

Step 3: Reshape Image Data

We convert the image into a 2D array where each row is a pixel’s RGB values.

pixels = image.reshape(-1, 3)  # Flatten into (num_pixels, 3)

Step 4: Apply K-Means

We cluster pixels into 8 color groups and assign each pixel its closest centroid color.

k = 8  # number of colors
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(pixels)
new_colors = kmeans.cluster_centers_[kmeans.labels_]

Step 5: Reconstruct the Image

We replace each pixel with its cluster centroid, creating a simplified image.

quantized_img = new_colors.reshape(image.shape).astype(np.uint8)
plt.imshow(quantized_img)
plt.axis('off')
plt.show()

Quick Reference: Image Color Quantization Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from skimage import io

image = io.imread('sample_image.jpg')
pixels = image.reshape(-1, 3)

k = 8
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(pixels)
new_colors = kmeans.cluster_centers_[kmeans.labels_]

quantized_img = new_colors.reshape(image.shape).astype(np.uint8)
plt.imshow(quantized_img)
plt.axis('off')
plt.show()

Real‑World Application — Customer Segmentation

Now let’s apply K-means to a business scenario: grouping customers into meaningful segments based on purchasing behavior.

Dataset Snapshot

Suppose we have the following customer dataset:

CustomerIDAgeAnnual Income (k$)Spending Score (1–100)
1191539
2211581
320166

Our goal: Find groups like “high income–low spenders” or “young high spenders.”

Step 1: Import Libraries and Load Data

We load the dataset containing customer demographics and spending patterns.

import pandas as pd

data = pd.read_csv("Mall_Customers.csv")
data.head()

Step 2: Select Relevant Features

We focus on income and spending score for clustering (2D for easy visualization).

X = data[['Annual Income (k$)', 'Spending Score (1-100)']]

Step 3: Apply K-Means

We cluster customers into 5 groups based on similarities in spending and income.

kmeans = KMeans(n_clusters=5, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)

Step 4: Visualize Clusters

We visualize how K-means grouped customers into distinct segments.

plt.scatter(X['Annual Income (k$)'], X['Spending Score (1-100)'], 
            c=data['Cluster'], cmap='rainbow')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segmentation')
plt.show()

Full Code Collection: Customer Segmentation Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

data = pd.read_csv("Mall_Customers.csv")
X = data[['Annual Income (k$)', 'Spending Score (1-100)']]

kmeans = KMeans(n_clusters=5, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)

plt.scatter(X['Annual Income (k$)'], X['Spending Score (1-100)'], 
            c=data['Cluster'], cmap='rainbow')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segmentation')
plt.show()

Strengths & Limitations

Strengths

  • Simple and efficient for large datasets.
  • Works well when clusters are spherical and well-separated.
  • Easy to implement and interpret.

Limitations

  • Requires predefining number of clusters K.
  • Sensitive to outliers and noise.
  • Assumes equal-sized clusters, which may not fit real-world data.

Final Notes

In this tutorial, we explored K-means clustering from both a visual toy problem (image color quantization) and a business scenario (customer segmentation).

We learned:

  • The theory behind K-means and its optimization objective.
  • How to apply clustering step-by-step to datasets.
  • The practical value of clustering in reducing complexity and uncovering hidden groups.

K-means is a fundamental building block in unsupervised learning, serving as the basis for more advanced clustering methods.

Next Steps for You:

Try applying K-means to text clustering (e.g., grouping news articles).

Explore advanced clustering algorithms like DBSCAN or Gaussian Mixture Models for non-spherical clusters.

References

[1] J. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.
[2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2009.
[3] Scikit-learn Documentation – Clustering, https://scikit-learn.org.

Leave a comment