param | Interactive Machine Learning Education

This one was a bit math heavy, but here we go. Support Vector Machines (SVMs) are one of the most elegant algorithms in machine learning. While we've built classifiers that find a decision boundary, SVMs find the best decision boundary—one with maximum margin from both classes.

The beauty of SVMs lies in their mathematical foundation. They transform a seemingly simple geometric idea into a sophisticated optimization problem, then use a clever trick (the kernel trick) to handle non-linear patterns. Understanding SVMs deeply means understanding convex optimization, Lagrange multipliers, and duality theory.

Let's build an SVM from scratch and work through all the math. It's going to get intense, but I'll try my best to make it as intuitive as possible.

The Maximum Margin Intuition

Finding the optimal decision boundary

Imagine you have two classes of data points that are linearly separable. There are infinitely many lines you could draw to separate them. Which one should you choose?

SVMs choose the line that's as far as possible from the nearest points of both classes. This distance is called the margin, and the line that maximizes it is the maximum margin hyperplane.

Why maximize the margin?

Intuitively, a larger margin means the classifier is more “confident” and should generalize better to new data. Points near the decision boundary are uncertain; staying far from them reduces the chance of misclassification due to noise or variation.

Interactive: Maximum Margin Hyperplane

Click on the canvas to add points. The boundary updates in real-time.

Show MarginHighlight Support Vectors

Add:

Class +1

Class -1

Support Vectors

Decision Boundary

The points that lie exactly on the margin boundaries are called support vectors. These are the critical points that completely determine the decision boundary. You could remove all other points and get the same hyperplane. This is why the algorithm is called Support Vector Machine.

Support Vectors Are Key

Only a small subset of training points (the support vectors) actually matter. This makes SVMs memory-efficient at prediction time—we only need to store the support vectors, not all training data.

The Primal Optimization Problem

Formalizing the optimization

Let's formalize this. We have training data {(x₁, y₁), …, (xₙ, yₙ)} where xᵢ ∈ ℝᵈ and yᵢ ∈ {−1, +1}.

We want to find a hyperplane defined by:

w·x+b=0

Hyperplane equation: w is normal vector, b is bias

wNormal vector to the hyperplane (defines the orientation)

bBias term (defines the position)

xInput data point

For a point xᵢ, the signed distance to the hyperplane is:

distance=(w·xᵢ+b) / ||w||

Signed distance from point to hyperplane

For the hyperplane to correctly classify all points with margin γ:

yᵢ(w · xᵢ + b) ≥ γ||w||for all i

The yᵢ ensures we're measuring distance on the correct side (positive for +1 class, negative for -1 class). By convention, we scale w and b so that the closest points (support vectors) satisfy yᵢ(w · xᵢ + b) = 1.

This sets γ = 1/||w||, so maximizing the margin becomes:

Hard-Margin SVM Primal Problem

Minimize:(1/2)||w||²

Subject to:yᵢ(w · xᵢ + b) ≥ 1 for all i

The (1/2) is for mathematical convenience (makes the derivative cleaner). This is a convex quadratic programming problem with linear inequality constraints.

Soft Margin: Allowing Mistakes

Handling real-world data

Real data isn't always linearly separable. We need to allow some points to be on the wrong side of the margin or even misclassified. We introduce slack variables ξᵢ ≥ 0 for each point:

yᵢ(w · xᵢ + b) ≥ 1 − ξᵢ

ξᵢ = 0

Point correctly classified with full margin.

0 < ξᵢ < 1

Correctly classified but inside the margin.

ξᵢ ≥ 1

Misclassified (on wrong side of boundary).

Now we want to minimize margin violations while still maximizing the margin:

Soft-Margin SVM Primal Problem

Minimize:(1/2)||w||² + C Σξᵢ

Subject to:

yᵢ(w · xᵢ + b) ≥ 1 − ξᵢandξᵢ ≥ 0

The parameter C controls the tradeoff between margin size and violations:

Large C (> 100)

Few violations allowed, focus on classifying training data correctly. Narrow margin. May overfit.

Small C (< 1)

More violations allowed, focus on maximizing margin. Wider margin. May underfit.

Lagrangian and Duality

The mathematical machinery

Solving the primal problem directly is hard with inequality constraints. This is where Lagrange multipliers and duality theory come in. The Lagrangian function incorporates the constraints via multipliers αᵢ ≥ 0:

lagrangian.py

1# Lagrangian for soft-margin SVM
2L(w, b, ξ, α, β) = (1/2)||w||² + C * Σξᵢ 
3                  - Σαᵢ[yᵢ(w·xᵢ + b) - 1 + ξᵢ] 
4                  - Σβᵢξᵢ

To find the optimal point, we take derivatives with respect to the primal variables (w, b, ξ) and set them to zero. These are the KKT (Karush-Kuhn-Tucker) stationarity conditions:

∂L/∂w = 0→w = Σαᵢyᵢxᵢ

Weights are linear combination of support vectors

∂L/∂b = 0→Σαᵢyᵢ = 0

Sum of weighted labels must be zero

∂L/∂ξᵢ = 0→αᵢ = C - βᵢ → 0 ≤ αᵢ ≤ C

Alphas bounded by C

Substituting these back into the Lagrangian gives us the dual problem:

SVM Dual Problem

Maximize:Σαᵢ − (1/2)ΣΣαᵢαⱼyᵢyⱼ(xᵢ·xⱼ)

Subject to:0 ≤ αᵢ ≤ CandΣαᵢyᵢ = 0

This is beautiful because instead of optimizing over w (d-dimensional), we're optimizing over α (n-dimensional, but sparse). The dual problem only depends on inner products xᵢ·xⱼ—this will be crucial for the kernel trick.

Understanding the Dual Variables

The Lagrange multipliers αᵢ tell us something important:
• αᵢ = 0: Point is correctly classified outside the margin (doesn't affect boundary)
• 0 < αᵢ < C: Point is a support vector on the margin boundary
• αᵢ = C: Point violates the margin (bounded support vector)

Sequential Minimal Optimization (SMO)

Efficient optimization

Now we need to solve the dual problem. Traditional quadratic programming solvers are slow. The SMO algorithm, invented by John Platt, is much faster.

The key insight: if we fix all α values except two, the problem has an analytical closed-form solution. SMO repeatedly selects pairs ofα values and optimizes them jointly.

Why Two Variables?

We can't optimize one α at a time because of the constraint Σαᵢyᵢ = 0. Changing one α would violate this. But with two variables, we can write:

α₁y₁ +α₂y₂ =constant

This means α₁ is determined by α₂, giving us a one-dimensional optimization problem.

smo.py

1def smo_algorithm(X, y, C, tol=1e-3, max_passes=5):
2    """Sequential Minimal Optimization for SVM."""
3    n_samples = len(y)
4    alpha = np.zeros(n_samples)
5    b = 0
6    E = -y.copy()  # Error cache
7    
8    passes = 0
9    while passes < max_passes:
10        num_changed = 0
11        
12        for i in range(n_samples):
13            # Check KKT conditions for alpha[i]
14            Ei = E[i]
15            
16            if (y[i] * Ei < -tol and alpha[i] < C) or \
17               (y[i] * Ei > tol and alpha[i] > 0):
18                
19                # Select j != i (maximize |Ei - Ej|)
20                j = select_second_alpha(i, Ei, E)
21                Ej = E[j]
22                
23                # Save old alphas
24                alpha_i_old = alpha[i]
25                alpha_j_old = alpha[j]
26                
27                # Compute bounds L and H
28                if y[i] != y[j]:
29                    L = max(0, alpha[j] - alpha[i])
30                    H = min(C, C + alpha[j] - alpha[i])
31                else:
32                    L = max(0, alpha[i] + alpha[j] - C)
33                    H = min(C, alpha[i] + alpha[j])
34                
35                if L == H:
36                    continue
37                
38                # Compute eta (second derivative)
39                eta = 2 * K[i,j] - K[i,i] - K[j,j]
40                if eta >= 0:
41                    continue
42                
43                # Update alpha[j]
44                alpha[j] = alpha_j_old - y[j] * (Ei - Ej) / eta
45                alpha[j] = np.clip(alpha[j], L, H)
46                
47                # Update alpha[i]
48                alpha[i] = alpha_i_old + y[i]*y[j]*(alpha_j_old - alpha[j])
49                
50                # Update bias and error cache
51                b = update_bias(b, alpha, i, j, Ei, Ej, ...)
52                E = update_errors(E, alpha, X, y, b)
53                
54                num_changed += 1
55        
56        if num_changed == 0:
57            passes += 1
58        else:
59            passes = 0
60    
61    return alpha, b

The analytical solution for updating α₂ is:

α₂^new=α₂^old−y₂(E₁−E₂) / η

Where η = 2K(x₁,x₂) - K(x₁,x₁) - K(x₂,x₂) and Eᵢ is the prediction error

After updating α₂ (clipped to bounds [L, H]), we compute α₁:

smo_update.py

1# Update alpha_1 from the constraint
2alpha_1_new = alpha_1_old + y1 * y2 * (alpha_2_old - alpha_2_new)
3
4# Update bias using KKT conditions
5b1 = b - E1 - y1*(alpha_1_new - alpha_1_old)*K[i1,i1] \
6         - y2*(alpha_2_new - alpha_2_old)*K[i1,i2]
7b2 = b - E2 - y1*(alpha_1_new - alpha_1_old)*K[i1,i2] \
8         - y2*(alpha_2_new - alpha_2_old)*K[i2,i2]
9
10if 0 < alpha_1_new < C:
11    b_new = b1
12elif 0 < alpha_2_new < C:
13    b_new = b2
14else:
15    b_new = (b1 + b2) / 2

The Kernel Trick

The magic of implicit feature maps

Here's where things get really cool. Look at the dual formulation—it only uses inner productsxᵢ·xⱼ. What if we replace those inner products with a more general function K(xᵢ, xⱼ)?

Kernelized Dual Problem

Maximize:Σαᵢ − (1/2)ΣΣαᵢαⱼyᵢyⱼK(xᵢ, xⱼ)

If K is a valid kernel (satisfies Mercer's condition), it corresponds to an inner product in some (possibly infinite-dimensional) feature space:

K(x, x') = φ(x) · φ(x')

The magic: we never need to compute φ(x) explicitly! The kernel computes the inner product in the transformed space directly.

The Kernel Trick Visualized

Different kernels work best for different data distributions

Linear Kernel

K(x, x') = x · x'

Finds a straight line (hyperplane) to separate classes. Works when clusters are on opposite sides.

Data Pattern

Left vs right clusters. A vertical line cleanly separates the two classes.

Linearly separable

The kernel trick computes inner products in transformed space without explicitly mapping points

Popular kernels include:

Linear KernelK(x, x') = x · x'

def linear_kernel(x1, x2):
    return np.dot(x1, x2)

No transformation. Standard linear SVM.

Polynomial KernelK(x, x') = (γx · x' + r)ᵈ

def poly_kernel(x1, x2, d=3, r=1):
    return (r + np.dot(x1, x2)) ** d

Maps to polynomial feature space. Captures curves and interactions.

RBF (Gaussian) KernelK(x, x') = exp(-γ||x - x'||²)

def rbf_kernel(x1, x2, gamma=0.1):
    return np.exp(-gamma * np.linalg.norm(x1 - x2) ** 2)

Maps to infinite dimensions! Most versatile kernel.

Why Kernels Work

The kernel trick is profound because of computational efficiency: computing K(x, x') is often much cheaper than computing φ(x) explicitly, especially for high or infinite dimensional spaces. We get the power of complex feature spaces without the computational cost.

Hyperparameters and Tuning

Tuning for performance

SVMs have several key hyperparameters that can play a large role in how the model trains:

Interactive: Hyperparameter Effects

Adjust the parameters to see how they affect the decision boundary and model complexity

C = 0.1C = 100

C = 1

Support Vectors: 5

C = 0.1

Wide margin, more violations allowed. May underfit.

C = 1

Balanced tradeoff. Good default starting point.

C = 100

Narrow margin, fewer violations. May overfit.

Class +1

Class -1

Support Vectors

Parameter	Effect	Tuning Guide
C	Regularization strength	Start with 1, try {0.1, 1, 10, 100}
γ (gamma)	RBF kernel width	Start with 1/n_features, try {0.01, 0.1, 1, 10}
degree	Polynomial degree	Usually 2 or 3, higher prone to overfitting
kernel	Feature mapping	Start with RBF, try linear if data is high-dim

A general tuning strategy:

Start with RBF kernel (most versatile)

Grid search C ∈ {0.1, 1, 10, 100} and γ ∈ {0.01, 0.1, 1, 10}

Use cross-validation to find the best combination

If RBF doesn't work well, try polynomial or linear

Look at support vector count—too few = underfitting, too many = overfitting

When to Use SVMs

SVMs are not the flashiest models in 2025, but they remain devastatingly useful. They shine in specific situations:

✓ SVMs Shine When

• Dataset isn't huge (100s to 10,000s of samples)
• Many more features than samples (high-dimensional)
• You need a strong baseline before going deep
• Clear margin of separation exists
• You want probabilistic outputs (with Platt scaling)
• Memory efficiency matters at prediction time

✗ Avoid SVMs When

• Dataset is massive (millions of samples)
• Training time scales poorly (O(n²) to O(n³))
• You need online learning (streaming data)
• Kernel selection is unclear
• Data is very noisy with lots of overlap
• Interpretability of features is critical

SVMs with RBF kernels were state-of-the-art for many problems before deep learning took over. They still win on small/medium datasets where neural networks would overfit. If an SVM with decent tuning can't separate your classes, your deep net probably won't magically fix it.

Beyond the Basics

Though we've built SVMs from scratch, here are various ways to extend and optimize:

Multi-class SVMs

SVMs are inherently binary classifiers. For multi-class, use one-vs-rest (OvR) or one-vs-one (OvO) strategies. Sklearn handles this automatically.

SVM Regression (SVR)

SVMs can do regression too! Instead of maximizing margin between classes, SVR fits a tube of width ε around the data and penalizes points outside it.

Online SVMs

Pegasos and other algorithms enable stochastic gradient descent for SVMs, making them feasible for larger datasets.

Kernel Approximation

Random Fourier Features and Nyström approximation let you approximate RBF kernels in linear time, enabling SVMs on millions of samples.

Platt Scaling

SVMs output distances, not probabilities. Platt scaling fits a sigmoid to convert SVM outputs to calibrated probability estimates.

Happy coding, and I hope you got the support you needed.

Contents