param | Interactive Machine Learning Education

Everything we've built so far has been dependent on pre-existing labels—training models to predict future ones. Unfortunately, in the real world, data labels are both expensive and inaccessible. What if you don't have labels? What if you just have data and want to find natural groupings or patterns within it?

This is when we use clustering. Unlike the supervised models we've seen until now, this is an unsupervised learning approach that finds structure in unlabeled data. Today I'm building three fundamentally different clustering algorithms from scratch: K-Means (the fast and simple standard), DBSCAN (for weird-shaped clusters and noise), and Gaussian Mixture Models (a probabilistic alternative).

The Clustering Problem

At its core, clustering is about grouping similar things together. Given a dataset with no labels, we want to partition it into groups (clusters) such that points within a group are more similar to each other than to points in other groups.

The General Problem

Given n data points in d-dimensional space, assign each point to one of k groups (or mark it as noise) such that some objective is optimized.

But what does “similar” mean? And what makes a good grouping? That's where different clustering algorithms diverge—they each have different notions of similarity and cluster quality.

Customer Segmentation

Group customers by purchasing behavior to target marketing campaigns. E-commerce companies cluster users into “bargain hunters,” “brand loyal,” “impulse buyers,” etc.

Image Segmentation

Partition an image into regions. Medical imaging uses clustering to separate tissues, tumors, or organs in scans.

Document Organization

Group similar documents together. News aggregators cluster articles about the same story from different sources.

Anomaly Detection

Identify outliers that don't fit any cluster. Fraud detection systems use this to find unusual transactions.

K-Means Clustering

The Classic Centroid-Based Approach

K-Means is probably the most widely used clustering algorithm. The idea is beautifully simple: a cluster is defined by its center point (centroid), and each data point belongs to whichever centroid it's closest to.

Interactive K-Means Clustering

Watch centroids move and points reassign until convergence

k =

Iteration

Phase

Initialize

Inertia (SSE)

—

Cluster Sizes

1. Initialize

Pick k random points as initial centroids

2. Assign

Assign each point to nearest centroid

3. Update

Move centroids to cluster means

The Algorithm

Initialize— Randomly select k points as initial centroids

Assignment— Assign each point to its nearest centroid

Update— Move each centroid to the mean of its assigned points

Repeat— Iterate until centroids stop moving (convergence)

kmeans_core.py

1def _initialize_centroids(self, X):
2    n_samples, n_features = X.shape
3    indices = np.random.choice(n_samples, self.n_clusters, replace=False)
4    return X[indices]
5
6def _assign_clusters(self, X, centroids):
7    distances = cdist(X, centroids, metric='euclidean')
8    return np.argmin(distances, axis=1)
9
10def _update_centroids(self, X, labels):
11    centroids = np.zeros((self.n_clusters, X.shape[1]))
12    for k in range(self.n_clusters):
13        cluster_points = X[labels == k]
14        if len(cluster_points) > 0:
15            centroids[k] = cluster_points.mean(axis=0)
16    return centroids

The cdist function computes the pairwise Euclidean distances between all points and all centroids. For each point, argmin gives us the index of the closest centroid.

kmeans_fit.py

1def fit(self, X):
2    self.centroids_ = self._initialize_centroids(X)
3    
4    for iteration in range(self.max_iters):
5        labels = self._assign_clusters(X, self.centroids_)
6        new_centroids = self._update_centroids(X, labels)
7        
8        # Check for convergence
9        centroid_shift = np.sum((new_centroids - self.centroids_) ** 2)
10        if centroid_shift < self.tol:
11            self.converged_ = True
12            break
13        
14        self.centroids_ = new_centroids
15    
16    self.labels_ = self._assign_clusters(X, self.centroids_)
17    return self

Why K-Means Works

K-Means is actually optimizing an objective function called inertia (or within-cluster sum of squares):

Minimize:Σᵢ Σⱼ∈Cᵢ||xⱼ−μᵢ||²

Sum of squared distances from each point to its cluster centroid

Where μᵢ is the centroid of cluster Cᵢ. The assignment and update steps are performing coordinate descent on this objective, which is guaranteed to decrease (or stay the same) at each iteration.

K-Means Limitations

K-Means assumes spherical clusters of similar size. It struggles with non-spherical shapes (elongated, curved, or irregular), clusters of very different sizes or densities, and outliers (they pull centroids away from the true cluster centers).

Choosing K: How Many Clusters?

The Elbow Method

Choosing k is critical. If it's too low, you'll lump distinct groups together. If it's too high, you'll split natural clusters into artificial subgroups. The elbow method helps you find the sweet spot.

The Elbow Method

Finding the optimal number of clusters

Selected k

Inertia

19997

Suggested k

Based on elbow point

The elbow is where adding more clusters doesn't significantly reduce inertia. It's the point of diminishing returns.

The idea: start with k = 1 and iterate upward, calculating inertia at each step. Plot inertia vs. k—you'll see it decreasing. Look for the “elbow” where adding more clusters stops significantly reducing inertia. That's your optimal k.

DBSCAN: Finding Clusters by Density

Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a completely different approach. Instead of defining clusters by centroids, it defines them by density—a cluster is a region where points are densely packed together, separated by regions of low density.

Core Points

Points with at least minPts neighbors within radius ε. They form the “bulk” of clusters.

Border Points

Within ε of a core point but don't have enough neighbors themselves. They're on the edges.

Noise Points

Not core points and not within ε of any core point. These are outliers.

Interactive DBSCAN Clustering

Density-based clustering with core, border, and noise points

ε =35

minPts =

Clusters Found

Point Types

Core0

Border0

Noise0

Parameters

ε (epsilon)35

minPts4

DBSCAN finds clusters by connecting points that are close together. A core point has at least minPts neighbors within radius ε. Border points are within ε of a core point but don't have enough neighbors. Noise points don't belong to any cluster.

dbscan_fit.py

1def fit(self, X):
2    n_samples = X.shape[0]
3    self.labels_ = np.full(n_samples, -1, dtype=int)  # -1 = noise
4    cluster_id = 0
5    
6    for point_idx in range(n_samples):
7        if self.labels_[point_idx] != -1:  # Already processed
8            continue
9        
10        neighbors = self._get_neighbors(X, point_idx)
11        
12        if len(neighbors) < self.min_samples:  # Not a core point
13            continue  # Stays as noise (may become border later)
14        
15        # Start a new cluster
16        self.labels_[point_idx] = cluster_id
17        seed_set = list(neighbors)
18        i = 0
19        
20        while i < len(seed_set):
21            neighbor_idx = seed_set[i]
22            
23            if self.labels_[neighbor_idx] == -1:
24                self.labels_[neighbor_idx] = cluster_id
25            
26            neighbor_neighbors = self._get_neighbors(X, neighbor_idx)
27            if len(neighbor_neighbors) >= self.min_samples:
28                for new_neighbor in neighbor_neighbors:
29                    if new_neighbor not in seed_set:
30                        seed_set.append(new_neighbor)
31            
32            i += 1
33        
34        cluster_id += 1

DBSCAN's genius is that it defines clusters by connectivity through density. Two points are in the same cluster if you can “walk” from one to the other through a chain of core points, where each step is at most ε distance.

Parameter	Too Small	Too Large
ε (epsilon)	Everything is noise	Everything is one cluster
minPts	More noise, denser core requirement	Sparse clusters split, more border points

DBSCAN Strengths

Unlike K-Means, DBSCAN can find clusters of arbitrary shapes, handles clusters of different sizes and densities, identifies noise and outliers automatically, and doesn't require specifying the number of clusters in advance.

Gaussian Mixture Models

Probabilistic Clustering

Both K-Means and DBSCAN make hard assignments—each point belongs to exactly one cluster (or is noise). Gaussian Mixture Models (GMMs) take a probabilistic approach: each point has a probability of belonging to each cluster.

The Generative Model

GMM assumes your data was generated by sampling from a mixture of k Gaussian (normal) distributions, each representing a different cluster:

πₖ— Probability of choosing cluster k (mixture weight)

μₖ— Mean vector of cluster k

Σₖ— Covariance matrix (captures shape and orientation)

Interactive Gaussian Mixture Model

Probabilistic clustering with soft assignments

components =

EM Iteration

Phase

Initialize

Log-Likelihood

—

Mixture Weights

E-Step (Expectation)

Compute probability each point belongs to each Gaussian

M-Step (Maximization)

Update means, covariances, and weights to maximize likelihood

GMMs are trained using the Expectation-Maximization (EM) algorithm, which alternates between two steps until convergence:

E-Step (Expectation)

Given current parameters, compute the responsibility: the probability that each point was generated by each Gaussian.

M-Step (Maximization)

Given these probabilities, update μ, Σ, and π to maximize the likelihood of the data.

γᵢₖ=[πₖ×N(xᵢ | μₖ, Σₖ)]/Σⱼ[πⱼ × N(xᵢ | μⱼ, Σⱼ)]

Posterior probability of cluster k given point xᵢ (Bayes' theorem)

gmm_em.py

1def _e_step(self, X):
2    n_samples = X.shape[0]
3    responsibilities = np.zeros((n_samples, self.n_components))
4    
5    for k in range(self.n_components):
6        dist = multivariate_normal(mean=self.means_[k], cov=self.covariances_[k])
7        responsibilities[:, k] = self.weights_[k] * dist.pdf(X)
8    
9    # Normalize to get probabilities (sum to 1 across clusters)
10    responsibilities /= responsibilities.sum(axis=1, keepdims=True)
11    return responsibilities
12
13def _m_step(self, X, responsibilities):
14    n_samples, n_features = X.shape
15    Nk = responsibilities.sum(axis=0)  # Effective points per cluster
16    
17    # Update weights
18    self.weights_ = Nk / n_samples
19    
20    # Update means (weighted average)
21    self.means_ = (responsibilities.T @ X) / Nk[:, np.newaxis]
22    
23    # Update covariances
24    for k in range(self.n_components):
25        diff = X - self.means_[k]
26        self.covariances_[k] = (responsibilities[:, k][:, np.newaxis] * diff).T @ diff / Nk[k]
27        self.covariances_[k] += np.eye(n_features) * 1e-6  # Regularization

EM is guaranteed to increase (or keep constant) the log-likelihood at each iteration, so it always converges to a local optimum. Like K-Means, it's sensitive to initialization.

Comparing the Three

Each algorithm has different strengths and assumptions. The best choice depends entirely on your data's structure and your requirements.

Clustering Algorithm Comparison

See how different algorithms handle various data shapes

Dataset Shape:

Well-separated, spherical clusters

Algorithm:

Clusters Found

Fit Quality

Good Match ✓

KMEANS handles this shape well

Algorithm Info

Assumes spherical clusters of similar size. Fast but rigid.

Algorithm	Cluster Shape	# Clusters	Noise Handling	Speed
K-Means	Spherical	Must specify	Poor (outliers skew centroids)	Very fast
DBSCAN	Arbitrary	Automatic	Excellent (explicit noise detection)	Medium
GMM	Elliptical	Must specify	Moderate (low probability)	Slower

K-Means

Best for: Well-separated, spherical clusters of similar size. When speed matters and you know k.

DBSCAN

Best for: Arbitrary-shaped clusters, noisy data, unknown number of clusters. When outlier detection matters.

GMM

Best for: Overlapping clusters, soft assignments needed, elliptical cluster shapes. Probabilistic interpretation.

Evaluating Clusters

Unlike supervised learning, we don't have ground truth labels to measure accuracy. How do we know if our clustering is good? Several metrics help evaluate cluster quality:

Silhouette Score

For each point, the silhouette score measures how similar it is to its own cluster compared to other clusters:

s(i)=(b(i)−a(i))/max(a(i), b(i))

a(i) = mean distance to own cluster, b(i) = mean distance to nearest other cluster

1Point is very close to its cluster and far from others (perfect)

0Point is on the border between clusters

-1Point is probably in the wrong cluster

Calinski-Harabasz Index

Ratio of between-cluster variance to within-cluster variance. Higher is better—favors clusters that are dense and well-separated.

Davies-Bouldin Index

Average similarity between each cluster and its most similar cluster. Lower is better—minimize within-cluster scatter, maximize between-cluster separation.

When to Use What

✓ Choose K-Means When

• Clusters are roughly spherical
• You know the number of clusters
• Speed is critical (large datasets)
• Clusters are well-separated
• You need a simple baseline

✓ Choose DBSCAN When

• Clusters have arbitrary shapes
• You don't know the number of clusters
• Data contains noise and outliers
• Clusters have varying densities
• You need outlier detection

✓ Choose GMM When

• Clusters are elliptical
• You need soft/probabilistic assignments
• Clusters overlap significantly
• Data is Gaussian-ish
• You need uncertainty estimates

⚠ Common Pitfalls

• Not scaling features before clustering
• Using K-Means for non-spherical data
• Wrong ε/minPts for DBSCAN
• Too many components in GMM
• Ignoring the elbow method

Final Thoughts

The clustering algorithm you choose depends entirely on your dataset, task at hand, and constraints. Though it may require some experimentation, these three algorithms cover a wide range of use cases. When in doubt, start with K-Means as a baseline, try DBSCAN if you see noise or non-spherical shapes, and consider GMM when you need probabilistic assignments.

Happy coding, and don't get too clustered.

Contents

The Clustering Problem

The General Problem

Customer Segmentation

Image Segmentation

Document Organization

Anomaly Detection

K-Means Clustering

Interactive K-Means Clustering

The Algorithm

Why K-Means Works

K-Means Limitations

Choosing K: How Many Clusters?

The Elbow Method

DBSCAN: Finding Clusters by Density

Core Points

Border Points

Noise Points

Interactive DBSCAN Clustering

DBSCAN Strengths

Gaussian Mixture Models

The Generative Model

Interactive Gaussian Mixture Model

E-Step (Expectation)

M-Step (Maximization)

Comparing the Three

Clustering Algorithm Comparison

K-Means

DBSCAN

GMM

Evaluating Clusters

Silhouette Score

Calinski-Harabasz Index

Davies-Bouldin Index

When to Use What

✓ Choose K-Means When

✓ Choose DBSCAN When

✓ Choose GMM When

⚠ Common Pitfalls

Final Thoughts