Back to Lessons
supervisedclassification

Logistic Regression

Binary decisions through probability. The sigmoid function demystified.

Written byOmansh
9 min read
From Scratch
Source Code

Despite the similar name, logistic regression does something fundamentally different from its cousin linear regression. It doesn't predict continuous values like house prices or temperatures. Instead, it predicts probabilities and makes binary (or even multi-class) decisions. Will this email be spam or not? Will this customer churn? Will this tumor be malignant or benign?

The jump from linear to logistic regression isn't huge, but the subtle differences matter a lot. Before this post, I'd recommend that you read the previous one I did on linear regression for proper context. Now, let's build it from the ground up and see what changes.

Why Logistic Regression?

Logistic regression is the go-to algorithm for binary classification. It's fast, interpretable, and probabilistic. Unlike methods that just give you a hard yes/no decision, logistic regression tells you “I'm 73% confident this is spam” or “There's a 15% chance this customer will churn.” That probability estimate is incredibly valuable in real-world applications where you need to make risk-based decisions.

The Foundation of Neural Networks

Logistic regression is the building block for more complex models. Neural networks are essentially stacked logistic regression units with fancy activation functions. Understanding this method deeply means understanding how modern deep learning actually works at its core.

From Linear to Logistic

The classification problem

Here's the issue with using linear regression for classification: linear regression outputs any real number from negative infinity to positive infinity. But for classification, we need probabilities, which must be between 0 and 1.

If we try to use linear regression for binary classification (treating 0 and 1 as our targets), we'd get predictions like -0.3 or 1.7, which don't make sense as probabilities. We need a way to squash our linear output into the [0, 1] range.

The Sigmoid Function

The magic transformation

The sigmoid function is our magic transformation. It takes any real number and maps it to a value between 0 and 1:

σ(z)=1 / (1 + e−z)

Maps any real number to the range (0, 1)

When z → +∞
σ(z) → 1
When z = 0
σ(z) = 0.5
When z → −∞
σ(z) → 0

Interactive: The Sigmoid Function

Drag the slider to see how the sigmoid transforms any real number to a probability

zσ(z)10.50
σ(0.00) =
0.5000

→ Maximum uncertainty (50/50)

sigmoid.py
1def sigmoid(z):
2 """Apply sigmoid function with numerical stability."""
3 return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

Notice the np.clip? That prevents numerical overflow when z gets extremely large or small. Without it, np.exp(-500) would cause issues. These little details matter when you're building from scratch.

Beautiful Derivative

The sigmoid has another elegant property: it's differentiable everywhere, and its derivative has the form σ'(z) = σ(z)(1 − σ(z)), which makes gradient descent computations extremely clean.

The Logistic Regression Model

Putting it together

Instead of our old linear regression model ŷ = Xw + b, we now work with:

ŷ=σ(Xw+b)

Pass the linear combination through sigmoid to get probabilities

We're still computing a linear combination of features, but we're passing it through the sigmoid. The linear part (Xw + b) is called the logit or log-odds, and the sigmoid transforms it into a probability.

The output ŷ now represents P(y = 1 | X) — the probability that the sample belongs to class 1 given the features. Consequently, P(y = 0 | X) = 1 − ŷ.

Cross-Entropy Loss

Measuring classification error

Mean squared error doesn't work well for logistic regression. Why? Because when we're predicting probabilities and our true labels are 0 or 1, MSE creates a non-convex cost surface with local minima. We need a cost function specifically designed for probability predictions.

Cross-entropy comes from information theory and measures the difference between two probability distributions. It quantifies how “surprised” we are by the true label given our predictions. If we predict high probability for the correct class, cross-entropy is low. If we predict low probability, cross-entropy is high (we're surprised).

For binary classification, this becomes binary cross-entropy or log loss:

J(w)=−(1/n) Σ [y log(ŷ) + (1−y) log(1−ŷ)]

Penalizes confident wrong predictions heavily

This looks intimidating, but it's actually simple to follow:

When y = 1

Only the first term matters: log(ŷ). If we predict ŷ close to 1, loss is small. If we predict close to 0, loss explodes (we're confidently wrong).

When y = 0

Only the second term matters: log(1 − ŷ). If we predict ŷ close to 0, loss is small. If we predict close to 1, loss explodes.

cross_entropy.py
1# In the fit method
2z_full = X @ self.coefficients_
3predictions_full = sigmoid(z_full)
4
5# Add small epsilon to prevent log(0)
6cost = -np.mean(
7 y * np.log(predictions_full + 1e-15) +
8 (1 - y) * np.log(1 - predictions_full + 1e-15)
9)

Numerical Stability

The 1e-15 additions prevent taking log(0) which would give us infinity. Always handle these edge cases in your implementations.

The Gradient

Surprisingly elegant

Something notable about cross-entropy with sigmoid is that the gradient has an incredibly clean form:

J/∂w=(1/n)XT(σ(Xw)y)

Errors on the probability scale, weighted by features

Compare this to linear regression's gradient: ∂MSE/∂w = (2/n) XT(Xw − y). They're almost identical! The difference is we're computing errors on the probability scale (after sigmoid) rather than the raw prediction scale.

gradient_descent.py
1for i in range(0, n_samples, batch_size):
2 batch_indices = indices[i:i+batch_size]
3 X_batch = X[batch_indices]
4 y_batch = y[batch_indices]
5
6 # Forward pass: compute predictions
7 z = X_batch @ self.coefficients_
8 predictions = sigmoid(z)
9
10 # Compute gradient
11 gradient = (1 / len(batch_indices)) * X_batch.T @ (predictions - y_batch)
12
13 # Update weights
14 self.coefficients_ -= self.learning_rate * gradient

Making Predictions

From probabilities to classes

Once we've trained the model, making predictions is a two-step process:

Step 1: Get Probabilities

def predict_proba(self, X):
    z = X @ self.coefficients_
    return sigmoid(z)

Step 2: Apply Threshold

def predict(self, X, threshold=0.5):
    probs = self.predict_proba(X)
    return (probs >= threshold).astype(int)

The default threshold is 0.5 (predict class 1 if probability ≥ 0.5), but this is tunable. If false positives are more costly than false negatives, you might use a higher threshold like 0.7. If you want to catch all possible positive cases, use a lower threshold like 0.3.

Probabilistic Power

This is where the probabilistic nature of logistic regression shines. You can adjust the threshold based on your specific use case without retraining the model.

Evaluating Classification Models

Beyond accuracy

For regression, we used MSE and R². For classification, we need different metrics. Accuracy is the obvious one, but it's not always enough.

Accuracy

What percentage of predictions were correct? Simple but can be misleading with imbalanced classes.

Precision

Of all samples we predicted as positive, what fraction actually were? High precision = few false alarms.

Recall

Of all actual positive samples, what fraction did we identify? High recall = we catch most positive cases.

F1 Score

Harmonic mean of precision and recall. A single metric that balances both: F1 = 2 × (P × R) / (P + R)

Interactive: Confusion Matrix

Adjust the threshold to see how it affects predictions and metrics

Confusion Matrix

Pred: 0
Pred: 1
Actual: 0
TN: 5
FP: 1
Actual: 1
FN: 1
TP: 5
Accuracy
83.3%
Precision
83.3%
Recall
83.3%
F1 Score
83.3%

Threshold: 0.50

p=0.92
y=1
p=0.87
y=1
p=0.73
y=1
p=0.68
y=0
p=0.61
y=1
p=0.55
y=1
p=0.48
y=0
p=0.42
y=0
p=0.35
y=1
p=0.28
y=0
p=0.19
y=0
p=0.12
y=0
metrics.py
1def precision_score(y_true, y_pred):
2 """Of predicted positives, how many were actually positive?"""
3 tp = np.sum((y_true == 1) & (y_pred == 1))
4 fp = np.sum((y_true == 0) & (y_pred == 1))
5 return tp / (tp + fp) if (tp + fp) > 0 else 0.0
6
7def recall_score(y_true, y_pred):
8 """Of actual positives, how many did we catch?"""
9 tp = np.sum((y_true == 1) & (y_pred == 1))
10 fn = np.sum((y_true == 1) & (y_pred == 0))
11 return tp / (tp + fn) if (tp + fn) > 0 else 0.0
12
13def f1_score(y_true, y_pred):
14 """Harmonic mean of precision and recall."""
15 p = precision_score(y_true, y_pred)
16 r = recall_score(y_true, y_pred)
17 return 2 * (p * r) / (p + r) if (p + r) > 0 else 0.0

Why Not Normal Equation?

You might wonder why linear regression has a closed-form solution but logistic regression doesn't. Cross-entropy loss with sigmoid creates a non-linear optimization problem—there's no way to take the derivative, set it to zero, and solve directly. Gradient descent is the universal tool here.

Beyond the Basics

Where to go from here

Though we've built logistic regression from scratch, here are various ways to extend and optimize this algorithm:

1

Multinomial Logistic Regression

Our implementation focuses on binary classification, but logistic regression extends to multiple classes using softmax regression. Instead of one sigmoid, you compute multiple outputs and normalize them so all class probabilities sum to 1.

2

Regularization

Add L1 (Lasso) or L2 (Ridge) penalties to the cost function to prevent overfitting. L1 can even drive some weights to exactly zero, giving you automatic feature selection.

3

Learning Rate Schedules & Momentum

Instead of a fixed learning rate, decay it over time or use adaptive methods like Adam. Momentum accumulates gradients from previous steps to smooth out convergence.

4

Class Weights

Handle imbalanced datasets by penalizing errors on rare classes more heavily. When you have 95% negative examples, weighting minority class errors higher forces the model to care about getting those rare cases right.

Happy coding, and recall this whenever you need.