Back to Lessons
unsupervised

Dimensionality Reduction

PCA, LDA, t-SNE, UMAP. Compressing high-dimensional worlds.

Written byOmansh
14 min read
From Scratch
Source Code

By now, you've seen datasets with 10, 50, maybe even 100 features. But what does that look like? How do you visualize a point in 100-dimensional space? How do you understand the structure of data when you can't even plot it?

The answer is that you can't. At least not directly. Humans are stuck in 3D, and even 3D is pushing it for most visualizations. This is where dimensionality reduction comes in—taking high-dimensional data and finding a lower-dimensional representation that preserves what's important.

Today I'm building four fundamental approaches from scratch: PCA (variance-based), LDA (class-separation-based), t-SNE (neighborhood-based), and UMAP (topology-based). Each takes a different philosophy on what's important to preserve when squashing down dimensions.

The Dimensionality Reduction Problem

At its core, dimensionality reduction is a mapping problem. You have data in d dimensions, and you want to find a representation in k dimensions (k < d) that preserves some important property of the data.

The Formal Problem

Given X ∈ ℝⁿˣᵈ (n samples, d features), find a transformation f such that Y = f(X) where Y ∈ ℝⁿˣᵏ, and Y preserves some property of X.

The challenge is that you're throwing away information—going from d dimensions to k dimensions means losing (d−k) dimensions worth of data. The art is in losing the right information.

Why Reduce Dimensions?

More than just visualization

Visualization

Humans can't visualize beyond 3D. Reducing to 2D or 3D lets us see patterns, clusters, and outliers that are invisible in high dimensions.

Computational Efficiency

Algorithms scale badly with dimensions. Training models, computing distances, and running clustering all speed up dramatically with fewer features.

Curse of Dimensionality

In high dimensions, all points become equidistant. A 100D unit hypercube has 99.99% of its volume in the corners. Distance-based methods like KNN stop working.

Noise Reduction

Often, high-dimensional data lives on a lower-dimensional manifold. The extra dimensions are just noise. Finding the underlying manifold removes noise.

Linear vs Nonlinear Methods

Linear methods (PCA, LDA) find linear projections—they multiply your data by a matrix. They're fast and well-understood but can only capture linear relationships.

Nonlinear methods (t-SNE, UMAP) can capture curved, twisted manifolds. They're more powerful but slower and harder to interpret. The right choice depends on your data structure and what you're optimizing for.

PCA: Maximizing Variance

Finding directions of maximum variance

Principal Component Analysis is the workhorse of dimensionality reduction. The idea is beautifully simple: find directions in feature space where the data varies the most, and project onto those directions.

The Intuition

Imagine data on people's heights and weights. These are correlated—tall people tend to be heavier. If you plot them, you get an elongated ellipse. The first principal component points along the long axis (maximum variance). The second points along the short axis (perpendicular, second-most variance).

Interactive: PCA Visualization

Drag the slider to rotate the projection axis. PC1 captures maximum variance.

PC1
Rotate PC Direction0.0% variance
Data Shape
Spread120
Noise30
Display
Explained Variance
50%
50%

PCA finds the directions of maximum variance by analyzing the covariance matrix. The covariance matrix captures how features vary together. If two features tend to increase together, they have positive covariance. If one increases when the other decreases, negative covariance. If they vary independently, zero covariance.

What most people miss is that the covariance matrix encodes the shape of your data. If your data forms an elongated ellipse, the covariance matrix captures the direction of elongation. The eigenvectors point along the axes of the ellipse; the eigenvalues tell you how stretched it is in each direction.

The PCA Algorithm

1
CenterSubtract the mean from each feature so data is centered at origin
2
CovarianceCompute the covariance matrix C = (1/n)X̄ᵀX̄
3
EigendecompositionFind eigenvectors and eigenvalues of C
4
SortSort eigenvectors by eigenvalue (descending)
5
SelectKeep top k eigenvectors as principal components
6
ProjectMultiply centered data by component matrix
pca.py
1def fit(self, X):
2 # Step 1: Center the data
3 self.mean_ = np.mean(X, axis=0)
4 X_centered = X - self.mean_
5
6 # Step 2: Compute covariance matrix
7 cov_matrix = np.cov(X_centered.T)
8
9 # Step 3: Eigendecomposition
10 eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
11
12 # Step 4: Sort by eigenvalue (descending)
13 idx = np.argsort(eigenvalues)[::-1]
14 eigenvalues = eigenvalues[idx]
15 eigenvectors = eigenvectors[:, idx]
16
17 # Step 5: Keep top k components
18 self.components_ = eigenvectors[:, :self.n_components].T
19 self.explained_variance_ = eigenvalues[:self.n_components]
20
21 return self
22
23def transform(self, X):
24 # Step 6: Project onto principal components
25 X_centered = X - self.mean_
26 return X_centered @ self.components_.T
Maximize:wCwsubject to||w|| = 1

The direction w that maximizes this is the largest eigenvector of C

Interactive: Explained Variance (Scree Plot)

Choose enough components to capture ~90-95% of variance

Selected: 3Captures: 67.3%
0%25%50%75%100%12345678910Principal ComponentVariance ExplainedIndividualCumulative
Components to Keep3
Total Components10
Variance Decay0.40

Higher = variance concentrated in fewer components

Suggested K

2

Elbow method suggests keeping 2 components (52.8% variance)

PCA Limitations

PCA is unsupervised—it doesn't know about class labels. If variance isn't aligned with what you care about (like class separation), PCA might not help. It also assumes linear relationships and can't capture curved manifolds.

LDA: Maximizing Class Separation

Supervised dimensionality reduction

Linear Discriminant Analysis is like PCA's supervised cousin. Instead of maximizing variance, it maximizes the separation between classes while minimizing variance within classes. This makes it perfect for preprocessing before classification.

The Intuition

Imagine two classes of points. PCA might find that the direction of maximum variance goes right through both classes, mixing them together. LDA finds the direction that best separates the classes—pulling class means apart while keeping each class tight.

Interactive: LDA vs PCA Class Separation

LDA maximizes class separation while PCA just finds maximum variance

LDA
1D Projection Comparison
LDA Projection

Notice how LDA separates the classes better, while PCA might mix them.

Class Configuration
Separation120
Class 1 Spread60
Class 2 Spread60
Display Options

LDA optimizes the ratio of between-class scatter to within-class scatter. Between-class scatter measures how far apart the class means are. Within-class scatter measures how spread out points are within each class. We want high between-class scatter (classes separated) and low within-class scatter (classes compact).

Maximize:J(w)=wSᴮw/wSᵂw

Sᴮ = between-class scatter, Sᵂ = within-class scatter

lda.py
1def fit(self, X, y):
2 classes = np.unique(y)
3 n_classes = len(classes)
4
5 # Compute class means and global mean
6 self.mean_ = np.mean(X, axis=0)
7 self.class_means_ = {}
8 for c in classes:
9 self.class_means_[c] = np.mean(X[y == c], axis=0)
10
11 # Within-class scatter matrix
12 within_class_scatter = np.zeros((X.shape[1], X.shape[1]))
13 for c in classes:
14 X_c = X[y == c]
15 within_class_scatter += np.cov(X_c.T) * (len(X_c) - 1)
16
17 # Between-class scatter matrix
18 between_class_scatter = np.zeros((X.shape[1], X.shape[1]))
19 for c in classes:
20 n_c = np.sum(y == c)
21 mean_diff = (self.class_means_[c] - self.mean_).reshape(-1, 1)
22 between_class_scatter += n_c * (mean_diff @ mean_diff.T)
23
24 # Solve generalized eigenvalue problem: Sb*v = λ*Sw*v
25 eigenvalues, eigenvectors = eigh(between_class_scatter, within_class_scatter)
26
27 # Take top (n_classes - 1) components
28 idx = np.argsort(eigenvalues)[::-1]
29 self.components_ = eigenvectors[:, idx[:self.n_components]].T
30
31 return self
Dimensionality Constraint: LDA can find at most (k−1) discriminants for k classes. For binary classification, you get exactly one dimension—the line that best separates the two classes.

t-SNE: Preserving Local Structure

Neighborhood preservation through probability

t-distributed Stochastic Neighbor Embedding takes a completely different approach. Instead of linear projections, it uses gradient descent to position points in low dimensions such that nearby points in high dimensions stay nearby, and far points stay far.

The Intuition

t-SNE models pairwise similarities. In high-D space, it computes “how similar is point i to point j?” as a probability. In low-D space, it tries to match these probabilities. The algorithm iteratively adjusts positions to minimize the difference between high-D and low-D similarity distributions.

Interactive: t-SNE Unrolling Manifolds

Compare how PCA vs t-SNE handle non-linear structure

3D Manifolddrag to rotate
PCA Projection (Linear)

PCA (linear) projects to the plane of maximum variance, but this mixes up points that are far apart along the manifold. Notice how different colors overlap—they shouldn't!

High-Dimensional Similarities: For each point i, we compute a probability distribution over all other points based on Gaussian similarity. The bandwidth σᵢ is chosen for each point to match a target perplexity—roughly the “effective number of neighbors.”

pⱼ|ᵢ=exp(−||xᵢxⱼ||² / 2σᵢ²)/Σₖ≠ᵢ exp(−||xᵢ−xₖ||² / 2σᵢ²)

Conditional probability that point i would pick point j as a neighbor

Low-Dimensional Similarities: In the embedding, we use a t-distribution (Student's t with 1 degree of freedom) instead of Gaussian. The heavier tails help avoid the “crowding problem” where dissimilar points get squeezed together.

qᵢⱼ=(1 + ||yᵢyⱼ||²)⁻¹/Σₖ≠ₗ(1 + ||yₖ−yₗ||²)⁻¹

t-distribution has heavier tails than Gaussian—allows distant points to be far

The objective is to minimize KL divergence between P (high-D) and Q (low-D):

tsne.py
1def fit_transform(self, X):
2 n = len(X)
3
4 # Compute high-D pairwise similarities (with perplexity)
5 P = self._compute_pairwise_affinities(X, self.perplexity)
6 P = (P + P.T) / (2 * n) # Symmetrize
7
8 # Initialize embedding randomly
9 Y = np.random.randn(n, self.n_components) * 0.01
10
11 # Gradient descent
12 for iteration in range(self.n_iter):
13 # Compute low-D similarities (t-distribution)
14 Q = self._compute_q_distribution(Y)
15
16 # Compute gradient: ∂KL/∂yᵢ = 4 Σⱼ (pᵢⱼ - qᵢⱼ)(yᵢ - yⱼ)(1 + ||yᵢ-yⱼ||²)⁻¹
17 gradients = np.zeros_like(Y)
18 for i in range(n):
19 diff = Y[i] - Y # (n, 2)
20 dist_sq = np.sum(diff ** 2, axis=1)
21 q_factor = (1 / (1 + dist_sq))
22 pq_diff = (P[i] - Q[i]) * q_factor
23 gradients[i] = 4 * np.sum(pq_diff[:, np.newaxis] * diff, axis=0)
24
25 # Update positions
26 Y = Y - self.learning_rate * gradients
27
28 return Y

t-SNE Quirks

t-SNE is powerful but has caveats: distances between clusters are meaningless—cluster A being close to B doesn't mean they're similar. It's not deterministic—different runs give different embeddings. And it's slow—O(n²) for pairwise similarities.

UMAP: Topology-Preserving Embedding

Faster, better global structure

UMAP (Uniform Manifold Approximation and Projection) arrived in 2018. Like t-SNE, it preserves local structure, but it's faster, preserves more global structure, and has better theoretical foundations in topology and manifold theory.

The Intuition

UMAP assumes your data lies on a manifold embedded in high-dimensional space. It constructs a fuzzy k-nearest-neighbor graph representing the manifold topology, then finds a low-dimensional layout of that graph. Edge weights represent “fuzzy membership”—not binary in/out, but partial membership strength.

Interactive: UMAP Cluster Separation

UMAP finds 2D layouts preserving cluster structure from 3D

3D Clustersdrag to rotate
2D Embedding0%

Fuzzy Membership: Instead of “point j is a neighbor of i” (yes/no), UMAP says “point j has 0.8 membership in i's neighborhood.” Closer points get higher membership; distant points get membership near 0. The membership function adapts to each point's local density.

umap.py
1def fit_transform(self, X):
2 n = len(X)
3
4 # Step 1: Build fuzzy k-NN graph
5 # For each point, find k nearest neighbors
6 # Compute local radius ρ (distance to nearest neighbor)
7 # Compute local scale σ (from perplexity-like target)
8
9 for i in range(n):
10 distances = compute_distances(X[i], X)
11 neighbors = np.argsort(distances)[1:self.n_neighbors + 1]
12
13 rho = distances[neighbors[0]] # Distance to nearest
14 sigma = binary_search_for_sigma(distances, neighbors)
15
16 # Fuzzy membership: exp(-max(0, d - ρ) / σ)
17 for j in neighbors:
18 weight = np.exp(-max(0, distances[j] - rho) / sigma)
19 graph[i, j] = weight
20
21 # Symmetrize via fuzzy union: a + b - a*b
22 graph = graph + graph.T - graph * graph.T
23
24 # Step 2: Optimize low-D layout (minimize cross-entropy)
25 Y = initialize_embedding(n)
26
27 for epoch in range(n_epochs):
28 # Attraction: pull connected points together
29 # Repulsion: push disconnected points apart
30 Y = gradient_step(Y, graph, learning_rate)
31
32 return Y
ParameterEffectDefault
n_neighborsLocal vs global structure15
min_distHow tightly points cluster0.1
metricDistance metric for neighborseuclidean
Why UMAP is faster: O(n log n) with approximate nearest neighbors vs O(n²) for t-SNE. Why better global structure: Cross-entropy objective preserves more inter-cluster relationships than KL divergence. Distances between clusters are more meaningful.

Comparing the Four

Each algorithm has different strengths and assumptions. The best choice depends on your data structure, whether you have labels, and what you want to preserve.

MethodTypePreservesSpeedLabels?
PCALinearVarianceVery FastNo
LDALinearClass SeparationVery FastYes
t-SNENonlinearLocal NeighborhoodsSlowNo
UMAPNonlinearTopologyFastNo*

PCA

  • ✓ Fast, deterministic, interpretable
  • ✓ Components have meaning (variance directions)
  • ✓ Inverse transform exists
  • ✗ Only captures linear structure
  • ✗ Ignores class labels

LDA

  • ✓ Uses class labels for better separation
  • ✓ Great preprocessing for classification
  • ✓ Fast and deterministic
  • ✗ Assumes Gaussian classes
  • ✗ Limited to (k−1) dimensions for k classes

t-SNE

  • ✓ Excellent cluster visualization
  • ✓ Captures nonlinear structure
  • ✓ Reveals hidden clusters
  • ✗ Slow (O(n²))
  • ✗ Non-deterministic
  • ✗ Global distances meaningless

UMAP

  • ✓ Fast (O(n log n))
  • ✓ Better global structure than t-SNE
  • ✓ Scales to millions of points
  • ✓ Theoretical foundation
  • ✗ Hyperparameter sensitive
  • ✗ Can create false structure

When to Use What

Use PCA when...

  • You need a fast, interpretable baseline
  • Data is roughly linear
  • You want to preserve global variance
  • You need inverse transform (reconstruction)
  • Preprocessing for another algorithm

Use LDA when...

  • You have class labels and care about classification
  • You want maximum class separation
  • Data is roughly Gaussian per class
  • Preprocessing for a classifier

Use t-SNE when...

  • Visualization is the primary goal
  • You want to reveal cluster structure
  • Dataset is < 10,000 points
  • You're okay with non-deterministic results

Use UMAP when...

  • Large dataset (t-SNE would be too slow)
  • You need some global structure preservation
  • Preparing for downstream clustering
  • You want faster iterations with hyperparameters

Final Thoughts

People might say UMAP is the “best” dimensionality reduction method, but PCA is faster, more interpretable, and deterministic. The right choice depends on your specific use case: data size, whether you have labels, what structure you need to preserve, and whether you need reproducibility. Start simple with PCA, then try nonlinear methods if needed.

Happy coding, and I hope you UMAP your dimensions correctly.