Contents
Despite the similar name, logistic regression does something fundamentally different from its cousin linear regression. It doesn't predict continuous values like house prices or temperatures. Instead, it predicts probabilities and makes binary (or even multi-class) decisions. Will this email be spam or not? Will this customer churn? Will this tumor be malignant or benign?
The jump from linear to logistic regression isn't huge, but the subtle differences matter a lot. Before this post, I'd recommend that you read the previous one I did on linear regression for proper context. Now, let's build it from the ground up and see what changes.
Why Logistic Regression?
Logistic regression is the go-to algorithm for binary classification. It's fast, interpretable, and probabilistic. Unlike methods that just give you a hard yes/no decision, logistic regression tells you “I'm 73% confident this is spam” or “There's a 15% chance this customer will churn.” That probability estimate is incredibly valuable in real-world applications where you need to make risk-based decisions.
The Foundation of Neural Networks
From Linear to Logistic
The classification problem
Here's the issue with using linear regression for classification: linear regression outputs any real number from negative infinity to positive infinity. But for classification, we need probabilities, which must be between 0 and 1.
If we try to use linear regression for binary classification (treating 0 and 1 as our targets), we'd get predictions like -0.3 or 1.7, which don't make sense as probabilities. We need a way to squash our linear output into the [0, 1] range.
The Sigmoid Function
The magic transformation
The sigmoid function is our magic transformation. It takes any real number and maps it to a value between 0 and 1:
Maps any real number to the range (0, 1)
Interactive: The Sigmoid Function
Drag the slider to see how the sigmoid transforms any real number to a probability
→ Maximum uncertainty (50/50)
1def sigmoid(z):2 """Apply sigmoid function with numerical stability."""3 return 1 / (1 + np.exp(-np.clip(z, -500, 500)))Notice the np.clip? That prevents numerical overflow when z gets extremely large or small. Without it, np.exp(-500) would cause issues. These little details matter when you're building from scratch.
Beautiful Derivative
The Logistic Regression Model
Putting it together
Instead of our old linear regression model ŷ = Xw + b, we now work with:
Pass the linear combination through sigmoid to get probabilities
We're still computing a linear combination of features, but we're passing it through the sigmoid. The linear part (Xw + b) is called the logit or log-odds, and the sigmoid transforms it into a probability.
The output ŷ now represents P(y = 1 | X) — the probability that the sample belongs to class 1 given the features. Consequently, P(y = 0 | X) = 1 − ŷ.
Cross-Entropy Loss
Measuring classification error
Mean squared error doesn't work well for logistic regression. Why? Because when we're predicting probabilities and our true labels are 0 or 1, MSE creates a non-convex cost surface with local minima. We need a cost function specifically designed for probability predictions.
Cross-entropy comes from information theory and measures the difference between two probability distributions. It quantifies how “surprised” we are by the true label given our predictions. If we predict high probability for the correct class, cross-entropy is low. If we predict low probability, cross-entropy is high (we're surprised).
For binary classification, this becomes binary cross-entropy or log loss:
Penalizes confident wrong predictions heavily
This looks intimidating, but it's actually simple to follow:
When y = 1
Only the first term matters: log(ŷ). If we predict ŷ close to 1, loss is small. If we predict close to 0, loss explodes (we're confidently wrong).
When y = 0
Only the second term matters: log(1 − ŷ). If we predict ŷ close to 0, loss is small. If we predict close to 1, loss explodes.
1# In the fit method2z_full = X @ self.coefficients_3predictions_full = sigmoid(z_full)45# Add small epsilon to prevent log(0)6cost = -np.mean(7 y * np.log(predictions_full + 1e-15) + 8 (1 - y) * np.log(1 - predictions_full + 1e-15)9)Numerical Stability
The Gradient
Surprisingly elegant
Something notable about cross-entropy with sigmoid is that the gradient has an incredibly clean form:
Errors on the probability scale, weighted by features
Compare this to linear regression's gradient: ∂MSE/∂w = (2/n) XT(Xw − y). They're almost identical! The difference is we're computing errors on the probability scale (after sigmoid) rather than the raw prediction scale.
1for i in range(0, n_samples, batch_size):2 batch_indices = indices[i:i+batch_size]3 X_batch = X[batch_indices]4 y_batch = y[batch_indices]5 6 # Forward pass: compute predictions7 z = X_batch @ self.coefficients_8 predictions = sigmoid(z)9 10 # Compute gradient11 gradient = (1 / len(batch_indices)) * X_batch.T @ (predictions - y_batch)12 13 # Update weights14 self.coefficients_ -= self.learning_rate * gradientMaking Predictions
From probabilities to classes
Once we've trained the model, making predictions is a two-step process:
Step 1: Get Probabilities
def predict_proba(self, X):
z = X @ self.coefficients_
return sigmoid(z)Step 2: Apply Threshold
def predict(self, X, threshold=0.5):
probs = self.predict_proba(X)
return (probs >= threshold).astype(int)The default threshold is 0.5 (predict class 1 if probability ≥ 0.5), but this is tunable. If false positives are more costly than false negatives, you might use a higher threshold like 0.7. If you want to catch all possible positive cases, use a lower threshold like 0.3.
Probabilistic Power
Evaluating Classification Models
Beyond accuracy
For regression, we used MSE and R². For classification, we need different metrics. Accuracy is the obvious one, but it's not always enough.
What percentage of predictions were correct? Simple but can be misleading with imbalanced classes.
Of all samples we predicted as positive, what fraction actually were? High precision = few false alarms.
Of all actual positive samples, what fraction did we identify? High recall = we catch most positive cases.
Harmonic mean of precision and recall. A single metric that balances both: F1 = 2 × (P × R) / (P + R)
Interactive: Confusion Matrix
Adjust the threshold to see how it affects predictions and metrics
Confusion Matrix
Threshold: 0.50
1def precision_score(y_true, y_pred):2 """Of predicted positives, how many were actually positive?"""3 tp = np.sum((y_true == 1) & (y_pred == 1))4 fp = np.sum((y_true == 0) & (y_pred == 1))5 return tp / (tp + fp) if (tp + fp) > 0 else 0.067def recall_score(y_true, y_pred):8 """Of actual positives, how many did we catch?"""9 tp = np.sum((y_true == 1) & (y_pred == 1))10 fn = np.sum((y_true == 1) & (y_pred == 0))11 return tp / (tp + fn) if (tp + fn) > 0 else 0.01213def f1_score(y_true, y_pred):14 """Harmonic mean of precision and recall."""15 p = precision_score(y_true, y_pred)16 r = recall_score(y_true, y_pred)17 return 2 * (p * r) / (p + r) if (p + r) > 0 else 0.0Why Not Normal Equation?
Beyond the Basics
Where to go from here
Though we've built logistic regression from scratch, here are various ways to extend and optimize this algorithm:
Multinomial Logistic Regression
Our implementation focuses on binary classification, but logistic regression extends to multiple classes using softmax regression. Instead of one sigmoid, you compute multiple outputs and normalize them so all class probabilities sum to 1.
Regularization
Add L1 (Lasso) or L2 (Ridge) penalties to the cost function to prevent overfitting. L1 can even drive some weights to exactly zero, giving you automatic feature selection.
Learning Rate Schedules & Momentum
Instead of a fixed learning rate, decay it over time or use adaptive methods like Adam. Momentum accumulates gradients from previous steps to smooth out convergence.
Class Weights
Handle imbalanced datasets by penalizing errors on rare classes more heavily. When you have 95% negative examples, weighting minority class errors higher forces the model to care about getting those rare cases right.
Happy coding, and recall this whenever you need.