param | Interactive Machine Learning Education

In the following post, I'll be moving past just typing from sklearn.linear_model import LinearRegression and calling it a day. Sure, it works. Sure, it's convenient. But every time I use it, I feel like I'm just skimming past something fundamental. What's actually happening when I call .fit()? How does it find those coefficients? Why does it work?

I decided to stop relying on the black box and build it myself.

Why Linear Regression Still Matters

Before we dive into the math and code, let's talk about why linear regression is still relevant in 2025. Sure, we have transformers and diffusion models now, but linear regression is still the workhorse of ML for a ton of real-world problems: predicting house prices based on square footage, estimating sales from advertising spend, forecasting demand based on historical data.

Linear regression is fundamentally about finding relationships between variables. You have some input features (like square footage, number of bedrooms) and you want to predict a continuous output (like price). The beauty of linear regression is its simplicity and interpretability. When your model says “each additional bedroom adds $50k to the price,” you can actually understand and explain that to stakeholders.

The Foundation of ML

Master linear regression, and you've got the building blocks for everything from logistic regression to neural networks. The gradient descent algorithm we'll explore is the same optimization technique powering today's largest language models.

The Core Idea

Finding the best-fitting line

Linear regression assumes there's a linear relationship between your inputs and output. Mathematically, we're trying to find the best line (or hyperplane in multiple dimensions) that fits our data.

For a single feature, this looks like:

ŷ=wx+b

Prediction equals weight times input, plus a bias term

Here, ŷ (y-hat) is our prediction, w is the weight (slope),x is our input feature, and b is the bias (intercept).

For multiple features, it can be extended to:

ŷ=Xw+b

Matrix multiplication: features times weights, plus bias

Our job is to find the weights w and bias b that make our predictions as close as possible to the actual values.

Try It: Fit a Line to the Data

MSE: 0.0

Feature (x)

Target (y)

Slope (w)1.00

Intercept (b)50.0

Drag the sliders to manually adjust the line, or click the button to find the optimal fit

The Cost Function

Measuring how wrong we are

Before we can optimize anything, we need a way to measure how good our current predictions are. This is where the cost function comes in. For linear regression, we use Mean Squared Error (MSE):

MSE=(1/n)Σ(y_i−ŷ_i)²

Sum of squared differences between actual and predicted values, averaged

This takes the difference between each actual value and predicted value, squares it (so negative errors don't cancel out positive ones), and averages them all.

metrics.py

1def mse(y_true, y_pred):
2    """Calculates the mean squared error between actual and predicted values."""
3    return np.mean((y_true - y_pred) ** 2)
4
5def r2_score(y_true, y_pred):
6    """Calculates the R-squared coefficient of determination."""
7    ss_res = np.sum((y_true - y_pred) ** 2)
8    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
9    return 1 - (ss_res / ss_tot)

R² Score Explained

R² ranges from negative infinity to 1, where 1 means perfect predictions and 0 means your model is no better than just predicting the mean. It tells you how much of the variance in your data the model explains.

Approach 1: Gradient Descent

The most important optimization algorithm in ML

Now let's talk about how to actually find those optimal weights. The first approach is gradient descent, which is probably the most important optimization algorithm in all of ML.

The Intuition

Imagine you're standing on a mountain in thick fog and you want to get to the bottom. You can't see far, but you can feel which direction is downhill. So you take small steps downhill repeatedly until you reach the valley. That's gradient descent.

The mountain is our cost function, and we want to find the lowest point (minimum cost). The direction downhill is given by the gradient (partial derivatives of the cost with respect to each parameter).

Gradient Descent Visualization

Loss

1.412

Step

Position

θ₁0.850

θ₂0.200

Learning Rate0.15

Current position

Global minimum

Optimization path

The gradient tells us how to adjust our weights to reduce the cost. For MSE, the gradient with respect to our weights is:

∂MSE/∂w=(2/n)X^T(Xw−y)

Look at prediction errors, weight by input features, and average

Don't let this intimidate you. It's saying: “look at how wrong your predictions are (Xw − y), weight those errors by the input features (X^T), and average them (2/n).” In other words, we are finding the direction of steepest decline for the function.

Once we find this value, the update rule is simply:

w←w−α·∇J(w)

Subtract the gradient scaled by learning rate from weights

Where α (alpha) is the learning rate, controlling how big our steps are.

Batch

Use all training data for each update. Most accurate gradient, but slow for large datasets.

Stochastic (SGD)

Use one random sample for each update. Fast but noisy.

Mini-Batch

Use a small batch of samples. Best of both worlds—faster than batch, more stable than SGD.

gradient_descent.py

1class GradientDescentLinearRegression(LinearRegression):
2    def __init__(self, learning_rate=0.01, max_iterations=1000, 
3                 tolerance=1e-6, fit_intercept=True, batch_size=None):
4        super().__init__(fit_intercept)
5        self.learning_rate = learning_rate
6        self.max_iterations = max_iterations
7        self.tolerance = tolerance
8        self.batch_size = batch_size
9        self.cost_history_ = []
10    
11    def fit(self, X, y, verbose=False):
12        X = np.array(X)
13        y = np.array(y).flatten()
14        
15        # Add intercept term if needed
16        if self.fit_intercept:
17            X = np.column_stack([np.ones(X.shape[0]), X])
18        
19        # Initialize weights to zero
20        self.coefficients_ = np.zeros(X.shape[1])
21        n_samples = X.shape[0]
22        batch_size = self.batch_size if self.batch_size else n_samples
23        
24        for iteration in range(self.max_iterations):
25            prev_coef = self.coefficients_.copy()
26            
27            # Mini-batch updates
28            for i in range(0, n_samples, batch_size):
29                X_batch = X[i:i+batch_size]
30                y_batch = y[i:i+batch_size]
31                
32                # Compute predictions and gradient
33                y_pred = X_batch @ self.coefficients_
34                gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
35                
36                # Update weights
37                self.coefficients_ -= self.learning_rate * gradient
38            
39            # Check for convergence
40            if np.linalg.norm(self.coefficients_ - prev_coef) < self.tolerance:
41                break

Why Add a Column of Ones?

Instead of treating the bias b separately, we add a column of ones to our feature matrix. Now the bias becomes just another weight (the first one), and we can use the same math for everything. When X has a column of ones, multiplying by the first weight is equivalent to adding a constant.

Approach 2: The Normal Equation

The closed-form solution

Gradient descent is iterative, meaning we take many steps towards the ideal solution. But for linear regression, there's actually a closed-form solution. We can compute the optimal weights directly in one shot.

Remember we want to minimize MSE. In calculus, to find a minimum, we take the derivative and set it to zero. Working through the math, we get:

w=(X^TX)⁻¹X^Ty

Closed-form solution using matrix inversion

This gives us the optimal weights directly by solving a system of linear equations.

normal_equation.py

1class LinearRegressionFromScratch(LinearRegression):
2    def fit(self, X, y):
3        X = np.array(X)
4        y = np.array(y).flatten()
5        
6        # Add intercept column
7        if self.fit_intercept:
8            X = np.column_stack([np.ones(X.shape[0]), X])
9        
10        # Apply normal equation
11        try:
12            self.coefficients_ = np.linalg.inv(X.T @ X) @ X.T @ y
13        except np.linalg.LinAlgError:
14            # If matrix is singular, use pseudoinverse
15            self.coefficients_ = np.linalg.pinv(X.T @ X) @ X.T @ y
16        
17        # Extract intercept
18        if self.fit_intercept:
19            self.intercept_ = self.coefficients_[0]
20            self.coefficients_ = self.coefficients_[1:]
21        else:
22            self.intercept_ = 0

Which to Use?

Choosing the right approach

Though the normal equation may seem like a streamlined solution to solving linear regression problems, there are drawbacks to both approaches which you should consider, particularly based on the size of your training dataset.

Gradient Descent vs Normal Equation

Feature	Gradient Descent	Normal Equation
Time Complexity	O(kn²)	O(n³)
Need to choose α
Need many iterations
Works well with large n
Feature scaling needed
Can get stuck in local minima	Possible (non-convex)

n = number of features • k = number of iterations • α = learning rate

Rule of Thumb

If you have more than ~10,000 features, the matrix inversion in the normal equation becomes computationally expensive (O(n³)). In such cases, gradient descent is preferred despite requiring more iterations.

Feature Scaling

Helping gradient descent converge faster

One thing you'll notice in the code is feature standardization:

preprocessing.py

1def standardize_features(X):
2    """Transform features to have mean 0 and std 1."""
3    mean = np.mean(X, axis=0)
4    std = np.std(X, axis=0)
5    std[std == 0] = 1  # Avoid division by zero
6    X_scaled = (X - mean) / std
7    return X_scaled, mean, std

This transforms each feature to have mean 0 and standard deviation 1. For gradient descent, if your features are on different scales (like square footage in thousands vs number of bedrooms in single digits), the cost function becomes elongated. This makes gradient descent take a zig-zag path and converge slowly.

Without Scaling

Zig-zag convergence

With Scaling

Direct convergence

For the normal equation, scaling doesn't affect the solution mathematically, but it can help with numerical stability when inverting matrices.

Beyond the Basics

Where to go from here

Though we've built a fully functioning linear regression implementation, there's a whole world of optimizations and variations worth exploring:

Regularization

Add L1 (Lasso) or L2 (Ridge) penalties to the cost function to prevent overfitting. These techniques add a term that penalizes large weights, forcing the model to prefer simpler solutions.

Polynomial Regression

Transform your features into polynomial terms (x², x³, interactions like x₁x₂) to capture non-linear relationships while still using linear regression machinery.

Learning Rate Schedules & Momentum

Instead of a fixed learning rate, decay it over time or use adaptive methods like momentum and Adam. Momentum accumulates gradients from previous steps to smooth out the path.

Newton's Method

While gradient descent uses first derivatives, Newton's method uses second derivatives (the Hessian matrix) to make smarter steps. It converges much faster but is computationally expensive.

Coordinate Descent

Instead of updating all weights simultaneously, update one weight at a time while holding others fixed. This is the preferred optimization method for Lasso regression.

Happy coding, and may your gradients always descend smoothly.

Contents