Back to Lessons
supervisedclassification

Naive Bayes

Probabilistic classification using Bayes' theorem. Naive yet powerful.

Written byOmansh
7 min read
From Scratch
Source Code

Bayes' theorem is one of the most fundamental formulas in statistics, and for good reason. It gives us a principled way to update our beliefs as we gather evidence, turning prior knowledge and new observations into posterior probabilities. In my life, I'm constantly using this theorem as a way of thinking. In machine learning, we can leverage this to build classifiers that think in terms of probabilities rather than boundaries.

Instead of asking “where should I draw the line between classes?” like logistic regression or SVMs, Naive Bayes asks “what's the probability this sample belongs to each class?” It's called “naive” because it makes a strong independence assumption that's often violated in practice, yet somehow it works remarkably well anyway. Let's build it from scratch and understand why this probabilistic approach is so effective.

Before You Continue

I'd recommend watching 3Blue1Brown's explanation of Bayes' Theorem on YouTube. His visual intuition is something I think about not just in ML/Math, but as I make assumptions throughout my life.

Bayes' Theorem: The Foundation

The foundation of probabilistic reasoning

Everything starts with Bayes' theorem. All it's saying is that the probability of H given E equals the probability of E given H, times the prior probability of H, divided by the probability of E.

P(H|E)=P(E|H)×P(H)/P(E)

Update beliefs by combining prior knowledge with new evidence

For classification, we rewrite this as:

P(y|X)=P(X|y)×P(y)/P(X)

Probability of class given features

P(y|X) — Posterior

Probability of class y given features X. This is what we want to find.

P(X|y) — Likelihood

Probability of seeing features X if the class is y.

P(y) — Prior

Probability of class y before seeing any features. Just count class frequencies.

P(X) — Evidence

Probability of seeing features X. Same for all classes, so we can ignore it for comparison.

To classify a new sample, we compute P(y|X) for each class and pick the one with highest probability. Since P(X) is the same for all classes, we can simplify to:

P(y|X)P(X|y)×P(y)

We just need to compute the likelihood and prior for each class

Interactive: Bayes' Theorem for Spam Detection

Adjust the probabilities to see how prior beliefs and evidence combine

Total Emails
Spam
Not Spam
Emails containing “FREE”
True +
False +
P(Spam | contains “FREE”) =
77.4%
0.240 / 0.310 = 0.774
P(word|spam)×P(spam) / P(word)
30%

What % of all emails are spam?

80%

If it's spam, how likely does it contain “FREE”?

10%

If it's not spam, how likely does it contain “FREE”?

Leaning spam, but not definitive.

The “Naive” Independence Assumption

A simplifying assumption that works

Here's where the “naive” part comes in. Computing P(X|y) directly is hard when X has many features. If you have 10 binary features, there are 2¹⁰ = 1024 possible combinations. You'd need to estimate the probability of each combination for each class. Additionally, for a small to mid-sized dataset, each of those combinations would only have 1–2 examples at most.

The Naive Assumption

All features are conditionally independent given the class.

P(X|y)=P(x₁|y)×P(x₂|y)××P(xₙ|y)

This is almost always false in reality. In spam detection, the presence of “free” and “money” in an email are clearly correlated. But this assumption makes the math tractable such that instead of estimating one probability for every feature combination, we estimate one probability per feature.

Why does this “naive” assumption work?
1

We only care about which class is most probable, not the exact probabilities

2

Even if the independence assumption is violated, the ranking of classes is often still correct

3

It's a form of regularization that prevents overfitting

The General Algorithm

Simple yet powerful

For any Naive Bayes classifier, the algorithm is beautifully simple:

Training

  1. 1.Calculate P(y) for each class (just count frequencies)
  2. 2.Calculate P(xᵢ|y) for each feature i and class y
  3. 3.Store these probabilities

Prediction

  1. 1.For each class y, compute P(y) × ∏P(xᵢ|y)
  2. 2.Pick the class with highest value

How do we calculate P(xᵢ|y)? This depends on the type of features, which is why we have different variants of Naive Bayes.

Gaussian Naive Bayes

For continuous features

When features are continuous (like height, temperature, age), we assume they follow a Gaussian (normal) distribution within each class.

P(xᵢ|y)=(1/√2πσ²) × exp(−(xᵢ−μ)²/2σ²)

Gaussian PDF where μ is mean and σ² is variance of feature for class y

gaussian_nb_fit.py
1def fit(self, X, y):
2 """Train Gaussian NB: compute mean and variance per class."""
3 X, y = np.array(X), np.array(y).flatten()
4 self.classes_ = np.unique(y)
5 n_features = X.shape[1]
6 n_classes = len(self.classes_)
7
8 # Initialize parameters
9 self.theta_ = np.zeros((n_classes, n_features)) # means
10 self.sigma_ = np.zeros((n_classes, n_features)) # variances
11 self.class_prior_ = np.zeros(n_classes)
12
13 for idx, c in enumerate(self.classes_):
14 X_c = X[y == c]
15 self.theta_[idx, :] = X_c.mean(axis=0) # mean μ
16 self.sigma_[idx, :] = X_c.var(axis=0) # variance σ²
17 self.class_prior_[idx] = X_c.shape[0] / X.shape[0] # P(y)
18
19 # Add smoothing to avoid division by zero
20 self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
21 self.sigma_ += self.epsilon_

For prediction, we compute the log likelihood to avoid numerical underflow (multiplying many small probabilities leads to tiny numbers):

gaussian_nb_predict.py
1def predict(self, X):
2 """Predict using log probabilities for numerical stability."""
3 X = np.array(X)
4 log_probs = np.zeros((X.shape[0], len(self.classes_)))
5
6 for idx in range(len(self.classes_)):
7 # Log prior
8 log_prior = np.log(self.class_prior_[idx])
9
10 # Log of Gaussian PDF
11 log_likelihood = -0.5 * np.sum(np.log(2.0 * np.pi * self.sigma_[idx, :]))
12 log_likelihood -= 0.5 * np.sum(
13 ((X - self.theta_[idx, :]) ** 2) / self.sigma_[idx, :],
14 axis=1
15 )
16
17 log_probs[:, idx] = log_prior + log_likelihood
18
19 return self.classes_[np.argmax(log_probs, axis=1)]

Interactive: Gaussian Naive Bayes Classification

Move the slider to classify a flower based on petal length using Gaussian distributions

Petal Length (cm)P(x|class)μ₁μ₂
Setosa (μ=2, σ=0.8)
Versicolor (μ=5, σ=1.2)
3.0 cm
P(x|Setosa)
0.2283
P(x|Versicolor)
0.0829
Posterior Probabilities (assuming equal priors)
Setosa73.4%
Versicolor26.6%
Prediction
Setosa

Why Log Probabilities?

Multiplying many probabilities (like 0.01 × 0.02 × 0.03...) quickly leads to underflow—numbers too small to represent. Taking logs converts multiplication to addition: log(a×b) = log(a) + log(b). Adding log probabilities is numerically stable.

Multinomial Naive Bayes

For count data

When features represent counts (like word frequencies in documents), we use the multinomial distribution. This is the go-to for text classification.

P(xᵢ|y)=(Nyi + α) / (Ny + αn)

N_yi = count of feature i in class y, N_y = total count in class y, α = smoothing

multinomial_nb_fit.py
1def fit(self, X, y):
2 """Train Multinomial NB with Laplace smoothing."""
3 X, y = np.array(X), np.array(y).flatten()
4 self.classes_ = np.unique(y)
5 n_features = X.shape[1]
6
7 self.feature_log_prob_ = np.zeros((len(self.classes_), n_features))
8 self.class_prior_ = np.zeros(len(self.classes_))
9
10 for idx, c in enumerate(self.classes_):
11 X_c = X[y == c]
12
13 # Feature counts with Laplace smoothing
14 feature_count = X_c.sum(axis=0) + self.alpha
15 total_count = feature_count.sum()
16
17 # Store log probabilities for numerical stability
18 self.feature_log_prob_[idx, :] = np.log(feature_count / total_count)
19 self.class_prior_[idx] = X_c.shape[0] / X.shape[0]

Laplace Smoothing

The α parameter (usually α=1) prevents zero probabilities. Without smoothing, if a word never appears in spam emails during training, any email containing that word would have P(X|spam) = 0, making it impossible to classify as spam regardless of other strong spam indicators.

With smoothing, even unseen features get a small non-zero probability

Example: Spam Detection

Training data:
Spam: “buy now”, “free money”, “click now”
Ham: “meeting tomorrow”, “lunch plans”
Word counts in spam:
“buy”: 1, “now”: 2, “free”: 1, “money”: 1, “click”: 1 → Total: 6
With α=1, vocabulary_size=8:
P(“now”|spam) = (2+1)/(6+8) = 3/14 ≈ 0.21
P(“meeting”|spam) = (0+1)/(6+8) = 1/14 ≈ 0.07 (smoothed!)

Bernoulli Naive Bayes

For binary features

When features are binary (present/absent), we use the Bernoulli distribution.

P(xᵢ|y)=pxᵢ × (1−p)1−xᵢ

p = P(feature present | class y)

This might look cryptic, but it's elegant. If feature is present (xᵢ=1), P = p. If feature is absent (xᵢ=0), P = 1−p. The formula handles both cases in one expression.

bernoulli_nb_fit.py
1def fit(self, X, y):
2 """Train Bernoulli NB: estimate P(feature=1|class)."""
3 X, y = np.array(X), np.array(y).flatten()
4 self.classes_ = np.unique(y)
5 n_features = X.shape[1]
6
7 # Store both log P(x=1|y) and log P(x=0|y)
8 self.feature_log_prob_ = np.zeros((len(self.classes_), 2, n_features))
9 self.class_prior_ = np.zeros(len(self.classes_))
10
11 for idx, c in enumerate(self.classes_):
12 X_c = X[y == c]
13
14 # Probability of feature being 1 (with smoothing)
15 feature_count = X_c.sum(axis=0) + self.alpha
16 total_count = X_c.shape[0] + 2 * self.alpha
17 feature_prob = feature_count / total_count
18
19 # Store log P(x=1|y) and log P(x=0|y)
20 self.feature_log_prob_[idx, 0, :] = np.log(feature_prob) # log P(1|y)
21 self.feature_log_prob_[idx, 1, :] = np.log(1 - feature_prob) # log P(0|y)
22 self.class_prior_[idx] = X_c.shape[0] / X.shape[0]
bernoulli_nb_predict.py
1def predict(self, X):
2 """Predict using both presence AND absence of features."""
3 X = np.array(X)
4 log_probs = np.zeros((X.shape[0], len(self.classes_)))
5
6 for idx in range(len(self.classes_)):
7 log_prior = np.log(self.class_prior_[idx])
8
9 # Key difference: considers BOTH presence and absence
10 log_likelihood = np.sum(
11 X * self.feature_log_prob_[idx, 0, :] + # when x=1
12 (1 - X) * self.feature_log_prob_[idx, 1, :], # when x=0
13 axis=1
14 )
15
16 log_probs[:, idx] = log_prior + log_likelihood
17
18 return self.classes_[np.argmax(log_probs, axis=1)]

Bernoulli vs Multinomial

The key difference: Bernoulli considers both presence AND absence of features. If “free” is typically present in spam but absent in ham, Bernoulli uses that information both ways. Multinomial only considers presence (counts).

Which Variant to Use?

Matching variants to data

The specific variation of Naive Bayes you'll implement depends a lot on the shape of your data. For example, it wouldn't make any sense to use a Bernoulli distribution when you have continuous data like age or salary.

Interactive: Compare Naive Bayes Variants

Click each variant to see how it handles different data types

Feature Type
Continuous
Examples
Height, Weight, Temperature
Likelihood Formula
P(x|y) = (1/√2πσ²) × e-(x-μ)²/2σ²
Best Use Case
Real-valued measurements that follow normal distributions
Sample: Iris Classification
Setosa
μ = 1.5cm, σ = 0.2
Versicolor
μ = 4.3cm, σ = 0.5
1.54.3
Gaussian
Continuous values
Multinomial
Frequency counts
Bernoulli
Binary features
VariantFeature TypeBest For
GaussianContinuousIris, medical measurements, sensor data
MultinomialCountsText classification, word frequencies
BernoulliBinaryDocument classification, feature presence

Naive Bayes in Practice

Real-world considerations

Some things I found that you should definitely take note of before applying this algorithm.

Limitations

  • • Independence assumption often badly violated
  • • Sensitive to irrelevant features
  • • Can't capture feature interactions
  • • Probability estimates often poorly calibrated

Strengths

  • • Excellent baseline—always try it first
  • • Works surprisingly well for text
  • • Handles minimal data without overfitting
  • • Real-time classification (extremely fast)
1

It's an excellent baseline

Always try Naive Bayes first. If your fancy deep learning model can't beat it, something's wrong.

2

It works surprisingly well for text

Despite obvious word dependencies, Naive Bayes is competitive with much more complex models for text classification. It's the default choice for spam filters.

3

Works with minimal data

When you have 50 training examples, Naive Bayes might be your only option that doesn't overfit.

4

Real-time classification

The speed makes it perfect for applications needing instant predictions on streaming data.

5

Feature selection matters

Since all features are multiplied together, irrelevant features can hurt performance. Remove noise.

The Bottom Line

Naive Bayes is fast, simple, and often surprisingly effective. Use it as your first baseline for classification tasks, especially text classification. If it performs well, you might not need anything more complex.

Happy coding, and don't be naive out there.