Back to Lessons
supervisedclassificationregression

K-Nearest Neighbors

Proximity-based learning. You are the average of your neighbors.

Written byOmansh
7 min read
From Scratch
Source Code

All of the models we've worked with involve a training phase where the algorithm learns parameters or builds a structure. But what if I told you there's an algorithm that doesn't really “learn” anything during training?

K-Nearest Neighbors (KNN) is beautifully simple. In order to classify a new point, just look at the k closest training examples and let them vote. No weights to optimize, no trees to grow, no parameters to tune beyond choosing k itself. It's one of the most intuitive algorithms in machine learning, and yet it's surprisingly powerful.

The Intuition

The assumption behind KNN is simple: similar things should be close together in feature space. If you want to know what a new data point is, look at its neighbors. Birds of a feather flock together.

For Classification

Find the k nearest neighbors, see which class is most common among them, and predict that class.

For Regression

Find the k nearest neighbors, average their values, and predict that average.

Interactive KNN Classification

Click anywhere to classify a point. Watch its neighbors vote!

k =
Class 0Class 1

Click anywhere in the visualization to place a test point

Distance Metrics

Measuring “Nearness”

The entire algorithm hinges on one question: what does “nearest” mean? We need a way to measure distance between points in feature space.

Euclidean Distance

The most common metric is Euclidean distance, which is just the straight-line distance you learned in geometry:

d(x₁,x₂)=√Σ(x₁ᵢx₂ᵢ

Square root of sum of squared differences

Manhattan Distance

Manhattan distance (also called L1 distance or taxicab distance) measures distance as if you're walking on a grid of city blocks:

d(x₁,x₂)=Σ|x₁ᵢx₂ᵢ|

Sum of absolute differences

Minkowski Distance

Minkowski distance is a generalization that includes both Euclidean and Manhattan as special cases:

d(x₁,x₂)=(Σ|x₁ᵢx₂ᵢ|p)1/p

Generalized distance with parameter p

p = 1Manhattan distance
p = 2Euclidean distance
p → ∞Chebyshev distance (max absolute difference)

Distance Metrics Comparison

See how different metrics measure "distance"

Distance Path
x₁x₂
Unit "Circle" (d = 1)
+x+y
euclidean
277.8
manhattan
380.0
minkowski
254.9

Straight-line distance (as the crow flies)

The parameter p controls how much we penalize differences in individual dimensions. Higher p means we care more about the single largest difference.

MetricWhen to UseProperties
Euclidean (p=2)Default choice, continuous featuresSensitive to outliers, assumes comparable scales
Manhattan (p=1)High dimensions, different feature typesLess sensitive to outliers, good for grid-like data
Minkowski (p>2)Custom tuning neededFlexible, approaches max difference as p increases

Feature Scaling Matters

Distance-based algorithms are extremely sensitive to feature scales. If one feature ranges from 0–1000 and another from 0–1, the first will dominate distance calculations. Always standardize your features before applying KNN.

The KNN Algorithm: Step by Step

Beautifully Lazy

The training phase is dead simple: we literally just store the training data. This is called “lazy learning” or “instance-based learning” because all the work happens at prediction time.

knn_fit.py
1def fit(self, X, y):
2 """Store training data. That's it."""
3 self.X_train = np.array(X)
4 self.y_train = np.array(y).flatten()
5 self.n_classes_ = len(np.unique(self.y_train))
6 return self

The prediction phase is where all the computation happens:

1

Calculate distances to all training points

2

Sort by distance

3

Select the k nearest neighbors

4

Vote (classification) or average (regression)

knn_neighbors.py
1def _get_neighbors(self, x):
2 """Get k nearest neighbors for a single sample."""
3 distances = []
4 for i, x_train in enumerate(self.X_train):
5 dist = self._calculate_distance(x, x_train)
6 distances.append((dist, i))
7
8 # Sort by distance and get k nearest
9 distances.sort(key=lambda item: item[0])
10 k_nearest = distances[:self.n_neighbors]
11 return k_nearest
12
13def _calculate_distance(self, x1, x2):
14 """Calculate distance based on chosen metric."""
15 if self.metric == 'euclidean':
16 return np.sqrt(np.sum((x1 - x2) ** 2))
17 elif self.metric == 'manhattan':
18 return np.sum(np.abs(x1 - x2))
19 else: # minkowski
20 return np.power(np.sum(np.abs(x1 - x2) ** self.p), 1/self.p)
knn_predict.py
1def predict(self, X):
2 """Predict class labels for samples in X."""
3 X = np.array(X)
4 predictions = []
5
6 for x in X:
7 neighbors = self._get_neighbors(x)
8 neighbor_labels = [self.y_train[idx] for _, idx in neighbors]
9
10 # Majority vote
11 most_common = Counter(neighbor_labels).most_common(1)[0][0]
12 predictions.append(most_common)
13
14 return np.array(predictions)

Choosing K: The Bias-Variance Tradeoff

The Only Hyperparameter

The hyperparameter k controls the bias-variance tradeoff in KNN:

k = 1 (very small)

High variance, low bias. The model is extremely sensitive to noise and outliers. Decision boundaries are jagged and can overfit.

k = moderate (5-20)

Balanced variance and bias. Smooth enough to ignore noise, flexible enough to capture patterns.

k = n (very large)

Low variance, high bias. Predicts the majority class everywhere. Extremely smooth but loses all local information.

Bias-Variance Tradeoff in KNN

See how k affects the decision boundary complexity

Decision boundary
k = 5

Balanced

Smooth enough to ignore noise, but flexible enough to capture the true pattern. A good default choice.

BiasVariance
HighHigh
k = 1
Overfitting risk
k = 5
Sweet spot
k = 15
Underfitting risk

A good starting point is k = √n where n is the number of training samples. Use cross-validation to tune from there. For binary classification, use odd numbers to avoid ties. Generally, larger k for noisy data, smaller k when you have lots of clean data.

Weighted Voting

Not All Neighbors Are Equal

The basic KNN treats all k neighbors equally. But intuitively, closer neighbors should matter more than farther ones.

Uniform Weighting (default)

Each of the k neighbors gets one vote, regardless of distance.

1
1
1
1
1

Distance Weighting

Closer neighbors get more weight: weight = 1/distance

5
3
2
1
0.5
knn_weighted.py
1def predict_weighted(self, X):
2 """Predict using distance-weighted voting."""
3 X = np.array(X)
4 predictions = []
5
6 for x in X:
7 neighbors = self._get_neighbors(x)
8
9 # Weight by inverse distance
10 class_weights = {}
11 for dist, idx in neighbors:
12 label = self.y_train[idx]
13 weight = 1 / (dist + 1e-8) # Avoid division by zero
14 class_weights[label] = class_weights.get(label, 0) + weight
15
16 # Predict class with highest total weight
17 prediction = max(class_weights, key=class_weights.get)
18 predictions.append(prediction)
19
20 return np.array(predictions)
Distance weighting helps when your k is large but you want nearby points to dominate, or when the nearest neighbor is much closer than the others, or if you want smoother probability estimates.

The Curse of Dimensionality

The Achilles' Heel

Here's where KNN runs into serious problems: high-dimensional spaces. As dimensions increase, distances become less meaningful.

The Problem

  • All points become roughly equidistant from each other
  • The concept of “nearest” neighbor breaks down
  • You need exponentially more data to maintain the same density

This is because in d dimensions, the volume of a hypersphere grows as rd. To maintain the same density of points, you need exponentially more samples as d increases.

Points Needed to Cover Space (spacing = 0.1)

1D
10
Line
2D
100
Square
3D
1,000
Cube
10D
10¹⁰
Hypercube

Solutions

Consider dimensionality reduction (PCA, t-SNE) before applying KNN, or use feature selection to remove irrelevant dimensions. Both approaches help make the “nearest” concept meaningful again.

When to Use KNN

KNN is often used as a baseline because it's so simple to implement and understand. If a more complex model can't beat KNN, something's probably wrong.

KNN Shines When

  • • Small to medium-sized datasets
  • • Low-dimensional features (< 20)
  • • Decision boundaries are irregular
  • • You need interpretable predictions
  • • Training time doesn't matter
  • • Semi-supervised learning scenarios

Avoid KNN When

  • • High-dimensional data
  • • You need fast predictions
  • • Your dataset is huge (millions+)
  • • Features are on different scales
  • • Your data is sparse
  • • More noise than signal

Beyond the Basics

KNN's simplicity is both its strength and weakness. Here are some techniques to address its limitations and scale it to real-world problems:

1

Ball Trees & KD-Trees

Data structures that enable O(log n) neighbor searches instead of O(n). Essential for large datasets.

2

Approximate Nearest Neighbors

Libraries like Annoy, Faiss, and ScaNN sacrifice some accuracy for massive speedups on million-scale datasets.

3

Locality Sensitive Hashing

Hash functions that map similar items to the same buckets, enabling sublinear search time.

4

Weighted Features

Not all features are equally important. Learn feature weights through cross-validation or gradient descent.

5

Kernel KNN

Use kernel functions to implicitly map data to higher dimensions where it might be more separable.

Happy coding, and keep your neighbors close.