Contents
In the following post, I'll be moving past just typing from sklearn.linear_model import LinearRegression and calling it a day. Sure, it works. Sure, it's convenient. But every time I use it, I feel like I'm just skimming past something fundamental. What's actually happening when I call .fit()? How does it find those coefficients? Why does it work?
I decided to stop relying on the black box and build it myself.
Why Linear Regression Still Matters
Before we dive into the math and code, let's talk about why linear regression is still relevant in 2025. Sure, we have transformers and diffusion models now, but linear regression is still the workhorse of ML for a ton of real-world problems: predicting house prices based on square footage, estimating sales from advertising spend, forecasting demand based on historical data.
Linear regression is fundamentally about finding relationships between variables. You have some input features (like square footage, number of bedrooms) and you want to predict a continuous output (like price). The beauty of linear regression is its simplicity and interpretability. When your model says “each additional bedroom adds $50k to the price,” you can actually understand and explain that to stakeholders.
The Foundation of ML
The Core Idea
Finding the best-fitting line
Linear regression assumes there's a linear relationship between your inputs and output. Mathematically, we're trying to find the best line (or hyperplane in multiple dimensions) that fits our data.
For a single feature, this looks like:
Prediction equals weight times input, plus a bias term
Here, ŷ (y-hat) is our prediction, w is the weight (slope),x is our input feature, and b is the bias (intercept).
For multiple features, it can be extended to:
Matrix multiplication: features times weights, plus bias
Our job is to find the weights w and bias b that make our predictions as close as possible to the actual values.
Drag the sliders to manually adjust the line, or click the button to find the optimal fit
The Cost Function
Measuring how wrong we are
Before we can optimize anything, we need a way to measure how good our current predictions are. This is where the cost function comes in. For linear regression, we use Mean Squared Error (MSE):
Sum of squared differences between actual and predicted values, averaged
This takes the difference between each actual value and predicted value, squares it (so negative errors don't cancel out positive ones), and averages them all.
1def mse(y_true, y_pred):2 """Calculates the mean squared error between actual and predicted values."""3 return np.mean((y_true - y_pred) ** 2)45def r2_score(y_true, y_pred):6 """Calculates the R-squared coefficient of determination."""7 ss_res = np.sum((y_true - y_pred) ** 2)8 ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)9 return 1 - (ss_res / ss_tot)R² Score Explained
Approach 1: Gradient Descent
The most important optimization algorithm in ML
Now let's talk about how to actually find those optimal weights. The first approach is gradient descent, which is probably the most important optimization algorithm in all of ML.
The Intuition
Imagine you're standing on a mountain in thick fog and you want to get to the bottom. You can't see far, but you can feel which direction is downhill. So you take small steps downhill repeatedly until you reach the valley. That's gradient descent.
The mountain is our cost function, and we want to find the lowest point (minimum cost). The direction downhill is given by the gradient (partial derivatives of the cost with respect to each parameter).
The gradient tells us how to adjust our weights to reduce the cost. For MSE, the gradient with respect to our weights is:
Look at prediction errors, weight by input features, and average
Don't let this intimidate you. It's saying: “look at how wrong your predictions are (Xw − y), weight those errors by the input features (XT), and average them (2/n).” In other words, we are finding the direction of steepest decline for the function.
Once we find this value, the update rule is simply:
Subtract the gradient scaled by learning rate from weights
Where α (alpha) is the learning rate, controlling how big our steps are.
Batch
Use all training data for each update. Most accurate gradient, but slow for large datasets.
Stochastic (SGD)
Use one random sample for each update. Fast but noisy.
Mini-Batch
Use a small batch of samples. Best of both worlds—faster than batch, more stable than SGD.
1class GradientDescentLinearRegression(LinearRegression):2 def __init__(self, learning_rate=0.01, max_iterations=1000, 3 tolerance=1e-6, fit_intercept=True, batch_size=None):4 super().__init__(fit_intercept)5 self.learning_rate = learning_rate6 self.max_iterations = max_iterations7 self.tolerance = tolerance8 self.batch_size = batch_size9 self.cost_history_ = []10 11 def fit(self, X, y, verbose=False):12 X = np.array(X)13 y = np.array(y).flatten()14 15 # Add intercept term if needed16 if self.fit_intercept:17 X = np.column_stack([np.ones(X.shape[0]), X])18 19 # Initialize weights to zero20 self.coefficients_ = np.zeros(X.shape[1])21 n_samples = X.shape[0]22 batch_size = self.batch_size if self.batch_size else n_samples23 24 for iteration in range(self.max_iterations):25 prev_coef = self.coefficients_.copy()26 27 # Mini-batch updates28 for i in range(0, n_samples, batch_size):29 X_batch = X[i:i+batch_size]30 y_batch = y[i:i+batch_size]31 32 # Compute predictions and gradient33 y_pred = X_batch @ self.coefficients_34 gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)35 36 # Update weights37 self.coefficients_ -= self.learning_rate * gradient38 39 # Check for convergence40 if np.linalg.norm(self.coefficients_ - prev_coef) < self.tolerance:41 breakWhy Add a Column of Ones?
Approach 2: The Normal Equation
The closed-form solution
Gradient descent is iterative, meaning we take many steps towards the ideal solution. But for linear regression, there's actually a closed-form solution. We can compute the optimal weights directly in one shot.
Remember we want to minimize MSE. In calculus, to find a minimum, we take the derivative and set it to zero. Working through the math, we get:
Closed-form solution using matrix inversion
This gives us the optimal weights directly by solving a system of linear equations.
1class LinearRegressionFromScratch(LinearRegression):2 def fit(self, X, y):3 X = np.array(X)4 y = np.array(y).flatten()5 6 # Add intercept column7 if self.fit_intercept:8 X = np.column_stack([np.ones(X.shape[0]), X])9 10 # Apply normal equation11 try:12 self.coefficients_ = np.linalg.inv(X.T @ X) @ X.T @ y13 except np.linalg.LinAlgError:14 # If matrix is singular, use pseudoinverse15 self.coefficients_ = np.linalg.pinv(X.T @ X) @ X.T @ y16 17 # Extract intercept18 if self.fit_intercept:19 self.intercept_ = self.coefficients_[0]20 self.coefficients_ = self.coefficients_[1:]21 else:22 self.intercept_ = 0Which to Use?
Choosing the right approach
Though the normal equation may seem like a streamlined solution to solving linear regression problems, there are drawbacks to both approaches which you should consider, particularly based on the size of your training dataset.
| Feature | Gradient Descent | Normal Equation |
|---|---|---|
| Time Complexity | O(kn²) | O(n³) |
| Need to choose α | ||
| Need many iterations | ||
| Works well with large n | ||
| Feature scaling needed | ||
| Can get stuck in local minima | Possible (non-convex) |
n = number of features • k = number of iterations • α = learning rate
Rule of Thumb
Feature Scaling
Helping gradient descent converge faster
One thing you'll notice in the code is feature standardization:
1def standardize_features(X):2 """Transform features to have mean 0 and std 1."""3 mean = np.mean(X, axis=0)4 std = np.std(X, axis=0)5 std[std == 0] = 1 # Avoid division by zero6 X_scaled = (X - mean) / std7 return X_scaled, mean, stdThis transforms each feature to have mean 0 and standard deviation 1. For gradient descent, if your features are on different scales (like square footage in thousands vs number of bedrooms in single digits), the cost function becomes elongated. This makes gradient descent take a zig-zag path and converge slowly.
Without Scaling
Zig-zag convergence
With Scaling
Direct convergence
Beyond the Basics
Where to go from here
Though we've built a fully functioning linear regression implementation, there's a whole world of optimizations and variations worth exploring:
Regularization
Add L1 (Lasso) or L2 (Ridge) penalties to the cost function to prevent overfitting. These techniques add a term that penalizes large weights, forcing the model to prefer simpler solutions.
Polynomial Regression
Transform your features into polynomial terms (x², x³, interactions like x₁x₂) to capture non-linear relationships while still using linear regression machinery.
Learning Rate Schedules & Momentum
Instead of a fixed learning rate, decay it over time or use adaptive methods like momentum and Adam. Momentum accumulates gradients from previous steps to smooth out the path.
Newton's Method
While gradient descent uses first derivatives, Newton's method uses second derivatives (the Hessian matrix) to make smarter steps. It converges much faster but is computationally expensive.
Coordinate Descent
Instead of updating all weights simultaneously, update one weight at a time while holding others fixed. This is the preferred optimization method for Lasso regression.
Happy coding, and may your gradients always descend smoothly.