param | Interactive Machine Learning Education

By now, you've seen datasets with 10, 50, maybe even 100 features. But what does that look like? How do you visualize a point in 100-dimensional space? How do you understand the structure of data when you can't even plot it?

The answer is that you can't. At least not directly. Humans are stuck in 3D, and even 3D is pushing it for most visualizations. This is where dimensionality reduction comes in—taking high-dimensional data and finding a lower-dimensional representation that preserves what's important.

Today I'm building four fundamental approaches from scratch: PCA (variance-based), LDA (class-separation-based), t-SNE (neighborhood-based), and UMAP (topology-based). Each takes a different philosophy on what's important to preserve when squashing down dimensions.

The Dimensionality Reduction Problem

At its core, dimensionality reduction is a mapping problem. You have data in d dimensions, and you want to find a representation in k dimensions (k < d) that preserves some important property of the data.

The Formal Problem

Given X ∈ ℝⁿˣᵈ (n samples, d features), find a transformation f such that Y = f(X) where Y ∈ ℝⁿˣᵏ, and Y preserves some property of X.

The challenge is that you're throwing away information—going from d dimensions to k dimensions means losing (d−k) dimensions worth of data. The art is in losing the right information.

Why Reduce Dimensions?

More than just visualization

Visualization

Humans can't visualize beyond 3D. Reducing to 2D or 3D lets us see patterns, clusters, and outliers that are invisible in high dimensions.

Computational Efficiency

Algorithms scale badly with dimensions. Training models, computing distances, and running clustering all speed up dramatically with fewer features.

Curse of Dimensionality

In high dimensions, all points become equidistant. A 100D unit hypercube has 99.99% of its volume in the corners. Distance-based methods like KNN stop working.

Noise Reduction

Often, high-dimensional data lives on a lower-dimensional manifold. The extra dimensions are just noise. Finding the underlying manifold removes noise.

Linear vs Nonlinear Methods

Linear methods (PCA, LDA) find linear projections—they multiply your data by a matrix. They're fast and well-understood but can only capture linear relationships.

Nonlinear methods (t-SNE, UMAP) can capture curved, twisted manifolds. They're more powerful but slower and harder to interpret. The right choice depends on your data structure and what you're optimizing for.

PCA: Maximizing Variance

Finding directions of maximum variance

Principal Component Analysis is the workhorse of dimensionality reduction. The idea is beautifully simple: find directions in feature space where the data varies the most, and project onto those directions.

The Intuition

Imagine data on people's heights and weights. These are correlated—tall people tend to be heavier. If you plot them, you get an elongated ellipse. The first principal component points along the long axis (maximum variance). The second points along the short axis (perpendicular, second-most variance).

Interactive: PCA Visualization

Drag the slider to rotate the projection axis. PC1 captures maximum variance.

Rotate PC Direction0.0% variance

Data Shape

Spread120

Noise30

Display

Show projectionsShow PC2

Explained Variance

50%

PCA finds the directions of maximum variance by analyzing the covariance matrix. The covariance matrix captures how features vary together. If two features tend to increase together, they have positive covariance. If one increases when the other decreases, negative covariance. If they vary independently, zero covariance.

What most people miss is that the covariance matrix encodes the shape of your data. If your data forms an elongated ellipse, the covariance matrix captures the direction of elongation. The eigenvectors point along the axes of the ellipse; the eigenvalues tell you how stretched it is in each direction.

The PCA Algorithm

Center— Subtract the mean from each feature so data is centered at origin

Covariance— Compute the covariance matrix C = (1/n)X̄ᵀX̄

Eigendecomposition— Find eigenvectors and eigenvalues of C

Sort— Sort eigenvectors by eigenvalue (descending)

Select— Keep top k eigenvectors as principal components

Project— Multiply centered data by component matrix

pca.py

1def fit(self, X):
2    # Step 1: Center the data
3    self.mean_ = np.mean(X, axis=0)
4    X_centered = X - self.mean_
5    
6    # Step 2: Compute covariance matrix
7    cov_matrix = np.cov(X_centered.T)
8    
9    # Step 3: Eigendecomposition
10    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
11    
12    # Step 4: Sort by eigenvalue (descending)
13    idx = np.argsort(eigenvalues)[::-1]
14    eigenvalues = eigenvalues[idx]
15    eigenvectors = eigenvectors[:, idx]
16    
17    # Step 5: Keep top k components
18    self.components_ = eigenvectors[:, :self.n_components].T
19    self.explained_variance_ = eigenvalues[:self.n_components]
20    
21    return self
22
23def transform(self, X):
24    # Step 6: Project onto principal components
25    X_centered = X - self.mean_
26    return X_centered @ self.components_.T

Maximize:wᵀCwsubject to||w|| = 1

The direction w that maximizes this is the largest eigenvector of C

Interactive: Explained Variance (Scree Plot)

Choose enough components to capture ~90-95% of variance

Selected: 3Captures: 66.6%

Components to Keep3

Total Components10

Variance Decay0.40

Higher = variance concentrated in fewer components

Suggested K

Elbow method suggests keeping 2 components (52.6% variance)

PCA Limitations

PCA is unsupervised—it doesn't know about class labels. If variance isn't aligned with what you care about (like class separation), PCA might not help. It also assumes linear relationships and can't capture curved manifolds.

LDA: Maximizing Class Separation

Supervised dimensionality reduction

Linear Discriminant Analysis is like PCA's supervised cousin. Instead of maximizing variance, it maximizes the separation between classes while minimizing variance within classes. This makes it perfect for preprocessing before classification.

The Intuition

Imagine two classes of points. PCA might find that the direction of maximum variance goes right through both classes, mixing them together. LDA finds the direction that best separates the classes—pulling class means apart while keeping each class tight.

Interactive: LDA vs PCA Class Separation

LDA maximizes class separation while PCA just finds maximum variance

1D Projection Comparison

LDA Projection

Notice how LDA separates the classes better, while PCA might mix them.

Class Configuration

Separation120

Class 1 Spread60

Class 2 Spread60

Display Options

Compare with PCAShow 1D projection

LDA optimizes the ratio of between-class scatter to within-class scatter. Between-class scatter measures how far apart the class means are. Within-class scatter measures how spread out points are within each class. We want high between-class scatter (classes separated) and low within-class scatter (classes compact).

Maximize:J(w)=wᵀSᴮw/wᵀSᵂw

Sᴮ = between-class scatter, Sᵂ = within-class scatter

lda.py

1def fit(self, X, y):
2    classes = np.unique(y)
3    n_classes = len(classes)
4    
5    # Compute class means and global mean
6    self.mean_ = np.mean(X, axis=0)
7    self.class_means_ = {}
8    for c in classes:
9        self.class_means_[c] = np.mean(X[y == c], axis=0)
10    
11    # Within-class scatter matrix
12    within_class_scatter = np.zeros((X.shape[1], X.shape[1]))
13    for c in classes:
14        X_c = X[y == c]
15        within_class_scatter += np.cov(X_c.T) * (len(X_c) - 1)
16    
17    # Between-class scatter matrix
18    between_class_scatter = np.zeros((X.shape[1], X.shape[1]))
19    for c in classes:
20        n_c = np.sum(y == c)
21        mean_diff = (self.class_means_[c] - self.mean_).reshape(-1, 1)
22        between_class_scatter += n_c * (mean_diff @ mean_diff.T)
23    
24    # Solve generalized eigenvalue problem: Sb*v = λ*Sw*v
25    eigenvalues, eigenvectors = eigh(between_class_scatter, within_class_scatter)
26    
27    # Take top (n_classes - 1) components
28    idx = np.argsort(eigenvalues)[::-1]
29    self.components_ = eigenvectors[:, idx[:self.n_components]].T
30    
31    return self

Dimensionality Constraint: LDA can find at most (k−1) discriminants for k classes. For binary classification, you get exactly one dimension—the line that best separates the two classes.

t-SNE: Preserving Local Structure

Neighborhood preservation through probability

t-distributed Stochastic Neighbor Embedding takes a completely different approach. Instead of linear projections, it uses gradient descent to position points in low dimensions such that nearby points in high dimensions stay nearby, and far points stay far.

The Intuition

t-SNE models pairwise similarities. In high-D space, it computes “how similar is point i to point j?” as a probability. In low-D space, it tries to match these probabilities. The algorithm iteratively adjusts positions to minimize the difference between high-D and low-D similarity distributions.

Interactive: t-SNE Unrolling Manifolds

Compare how PCA vs t-SNE handle non-linear structure

3D Manifolddrag to rotate

PCA Projection (Linear)

PCA (linear) projects to the plane of maximum variance, but this mixes up points that are far apart along the manifold. Notice how different colors overlap—they shouldn't!

High-Dimensional Similarities: For each point i, we compute a probability distribution over all other points based on Gaussian similarity. The bandwidth σᵢ is chosen for each point to match a target perplexity—roughly the “effective number of neighbors.”

pⱼ|ᵢ=exp(−||xᵢ−xⱼ||² / 2σᵢ²)/Σₖ≠ᵢ exp(−||xᵢ−xₖ||² / 2σᵢ²)

Conditional probability that point i would pick point j as a neighbor

Low-Dimensional Similarities: In the embedding, we use a t-distribution (Student's t with 1 degree of freedom) instead of Gaussian. The heavier tails help avoid the “crowding problem” where dissimilar points get squeezed together.

qᵢⱼ=(1 + ||yᵢ−yⱼ||²)⁻¹/Σₖ≠ₗ(1 + ||yₖ−yₗ||²)⁻¹

t-distribution has heavier tails than Gaussian—allows distant points to be far

The objective is to minimize KL divergence between P (high-D) and Q (low-D):

tsne.py

1def fit_transform(self, X):
2    n = len(X)
3    
4    # Compute high-D pairwise similarities (with perplexity)
5    P = self._compute_pairwise_affinities(X, self.perplexity)
6    P = (P + P.T) / (2 * n)  # Symmetrize
7    
8    # Initialize embedding randomly
9    Y = np.random.randn(n, self.n_components) * 0.01
10    
11    # Gradient descent
12    for iteration in range(self.n_iter):
13        # Compute low-D similarities (t-distribution)
14        Q = self._compute_q_distribution(Y)
15        
16        # Compute gradient: ∂KL/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ-yⱼ||²)⁻¹
17        gradients = np.zeros_like(Y)
18        for i in range(n):
19            diff = Y[i] - Y  # (n, 2)
20            dist_sq = np.sum(diff ** 2, axis=1)
21            q_factor = (1 / (1 + dist_sq))
22            pq_diff = (P[i] - Q[i]) * q_factor
23            gradients[i] = 4 * np.sum(pq_diff[:, np.newaxis] * diff, axis=0)
24        
25        # Update positions
26        Y = Y - self.learning_rate * gradients
27        
28    return Y

t-SNE Quirks

t-SNE is powerful but has caveats: distances between clusters are meaningless—cluster A being close to B doesn't mean they're similar. It's not deterministic—different runs give different embeddings. And it's slow—O(n²) for pairwise similarities.

UMAP: Topology-Preserving Embedding

Faster, better global structure

UMAP (Uniform Manifold Approximation and Projection) arrived in 2018. Like t-SNE, it preserves local structure, but it's faster, preserves more global structure, and has better theoretical foundations in topology and manifold theory.

The Intuition

UMAP assumes your data lies on a manifold embedded in high-dimensional space. It constructs a fuzzy k-nearest-neighbor graph representing the manifold topology, then finds a low-dimensional layout of that graph. Edge weights represent “fuzzy membership”—not binary in/out, but partial membership strength.

Interactive: UMAP Cluster Separation

UMAP finds 2D layouts preserving cluster structure from 3D

3D Clustersdrag to rotate

2D Embedding0%

Fuzzy Membership: Instead of “point j is a neighbor of i” (yes/no), UMAP says “point j has 0.8 membership in i's neighborhood.” Closer points get higher membership; distant points get membership near 0. The membership function adapts to each point's local density.

umap.py

1def fit_transform(self, X):
2    n = len(X)
3    
4    # Step 1: Build fuzzy k-NN graph
5    # For each point, find k nearest neighbors
6    # Compute local radius ρ (distance to nearest neighbor)
7    # Compute local scale σ (from perplexity-like target)
8    
9    for i in range(n):
10        distances = compute_distances(X[i], X)
11        neighbors = np.argsort(distances)[1:self.n_neighbors + 1]
12        
13        rho = distances[neighbors[0]]  # Distance to nearest
14        sigma = binary_search_for_sigma(distances, neighbors)
15        
16        # Fuzzy membership: exp(-max(0, d - ρ) / σ)
17        for j in neighbors:
18            weight = np.exp(-max(0, distances[j] - rho) / sigma)
19            graph[i, j] = weight
20    
21    # Symmetrize via fuzzy union: a + b - a*b
22    graph = graph + graph.T - graph * graph.T
23    
24    # Step 2: Optimize low-D layout (minimize cross-entropy)
25    Y = initialize_embedding(n)
26    
27    for epoch in range(n_epochs):
28        # Attraction: pull connected points together
29        # Repulsion: push disconnected points apart
30        Y = gradient_step(Y, graph, learning_rate)
31    
32    return Y

Parameter	Effect	Default
n_neighbors	Local vs global structure	15
min_dist	How tightly points cluster	0.1
metric	Distance metric for neighbors	euclidean

Why UMAP is faster: O(n log n) with approximate nearest neighbors vs O(n²) for t-SNE. Why better global structure: Cross-entropy objective preserves more inter-cluster relationships than KL divergence. Distances between clusters are more meaningful.

Comparing the Four

Each algorithm has different strengths and assumptions. The best choice depends on your data structure, whether you have labels, and what you want to preserve.

Method	Type	Preserves	Speed	Labels?
PCA	Linear	Variance	Very Fast	No
LDA	Linear	Class Separation	Very Fast	Yes
t-SNE	Nonlinear	Local Neighborhoods	Slow	No
UMAP	Nonlinear	Topology	Fast	No*

PCA

✓ Fast, deterministic, interpretable
✓ Components have meaning (variance directions)
✓ Inverse transform exists
✗ Only captures linear structure
✗ Ignores class labels

LDA

✓ Uses class labels for better separation
✓ Great preprocessing for classification
✓ Fast and deterministic
✗ Assumes Gaussian classes
✗ Limited to (k−1) dimensions for k classes

t-SNE

✓ Excellent cluster visualization
✓ Captures nonlinear structure
✓ Reveals hidden clusters
✗ Slow (O(n²))
✗ Non-deterministic
✗ Global distances meaningless

UMAP

✓ Fast (O(n log n))
✓ Better global structure than t-SNE
✓ Scales to millions of points
✓ Theoretical foundation
✗ Hyperparameter sensitive
✗ Can create false structure

When to Use What

Use PCA when...

•You need a fast, interpretable baseline
•Data is roughly linear
•You want to preserve global variance
•You need inverse transform (reconstruction)
•Preprocessing for another algorithm

Use LDA when...

•You have class labels and care about classification
•You want maximum class separation
•Data is roughly Gaussian per class
•Preprocessing for a classifier

Use t-SNE when...

•Visualization is the primary goal
•You want to reveal cluster structure
•Dataset is < 10,000 points
•You're okay with non-deterministic results

Use UMAP when...

•Large dataset (t-SNE would be too slow)
•You need some global structure preservation
•Preparing for downstream clustering
•You want faster iterations with hyperparameters

Final Thoughts

People might say UMAP is the “best” dimensionality reduction method, but PCA is faster, more interpretable, and deterministic. The right choice depends on your specific use case: data size, whether you have labels, what structure you need to preserve, and whether you need reproducibility. Start simple with PCA, then try nonlinear methods if needed.

Happy coding, and I hope you UMAP your dimensions correctly.

Contents

The Dimensionality Reduction Problem

The Formal Problem

Why Reduce Dimensions?

Visualization

Computational Efficiency

Curse of Dimensionality

Noise Reduction

Linear vs Nonlinear Methods

PCA: Maximizing Variance

The Intuition

Interactive: PCA Visualization

Data Shape

Display

Explained Variance

The PCA Algorithm

Interactive: Explained Variance (Scree Plot)

Suggested K

PCA Limitations

LDA: Maximizing Class Separation

The Intuition

Interactive: LDA vs PCA Class Separation

1D Projection Comparison

Class Configuration

Display Options

t-SNE: Preserving Local Structure

The Intuition

Interactive: t-SNE Unrolling Manifolds

t-SNE Quirks

UMAP: Topology-Preserving Embedding

The Intuition

Interactive: UMAP Cluster Separation

Comparing the Four

PCA

LDA

t-SNE

UMAP

When to Use What

Use PCA when...

Use LDA when...

Use t-SNE when...

Use UMAP when...

Final Thoughts