UML-05: Gaussian Mixture Models

Summary
Master GMM: The 'Explorer in the Fog' of clustering. Learn how to map overlapping 'hills' of data, handle uncertainty with soft clustering, and master the EM algorithm.

Learning Objectives

After reading this post, you will be able to:

  • Understand Gaussian Mixture Models as a probabilistic clustering approach
  • Know the difference between hard and soft clustering
  • Understand the Expectation-Maximization (EM) algorithm
  • Choose between GMM and K-Means for your clustering tasks

Theory

The Intuition: The Explorer in the Fog

  • K-Means (Clear Day): You are mapping islands in bright sunlight. The boundaries are sharp. You are either on Island A or Island B.
  • GMM (Foggy Terrain): You are mapping hills in a thick fog. You can’t see the edges clearly. You stand at a point and say: “I am 70% sure I’m on Hill A, but there’s a 30% chance I’m actually on the lower slope of Hill B.”

Gaussian Mixture Models (GMM) embrace this uncertainty using soft clustering.

graph TD subgraph "K-Means (Hard Borders)" A1["Point X"] --> B1["Cluster 1 ✓"] A1 -.-> C1["Cluster 2 ✗"] end subgraph "GMM (Soft Gradients)" A2["Point X"] --> B2["Cluster 1: 70%"] A2 --> C2["Cluster 2: 30%"] end style B1 fill:#c8e6c9 style B2 fill:#c8e6c9 style C2 fill:#fff9c4
Key insight: GMM models the data as coming from a mixture of several Gaussian (normal) distributions, each representing a cluster.

The Gaussian Mixture Model

A GMM assumes data is generated from $K$ Gaussian components:

$$p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

where:

  • $\pi_k$ = mixing coefficient (weight of cluster $k$, $\sum_k \pi_k = 1$)
  • $\mu_k$ = mean of cluster $k$
  • $\Sigma_k$ = covariance matrix of cluster $k$
  • $\mathcal{N}(x | \mu, \Sigma)$ = Gaussian probability density

Think of it as a recipe: A GMM is like a smoothie made of $K$ different fruits.

  • The Fruit ($\mathcal{N}$): The base flavor (the shape of the distribution).
  • The Amount ($\pi_k$): How much of each fruit you put in (e.g., 70% Strawberry, 30% Banana).
  • The Result ($p(x)$): The final mixture that describes your entire dataset.
GMM as mixture of Gaussians
GMM: Each cluster is a Gaussian distribution with its own mean, covariance, and weight

GMM vs K-Means

AspectK-MeansGMM
AssignmentHard (0 or 1)Soft (probabilities)
Cluster shapeSpherical onlyElliptical (any covariance)
ModelDistance-basedProbabilistic
ParametersCentroids onlyMean, covariance, weights
OutputCluster labelsProbabilities + labels

The EM Algorithm

GMM parameters are learned using the Expectation-Maximization (EM) algorithm:

graph LR subgraph Loop ["The Surveyor's Loop (EM Algorithm)"] direction LR A["Guess Initial Locations"] --> B["E-Step: Estimate Membership\n(Where are the points?)"] B --> C["M-Step: Update Map\n(Move peaks to center of mass)"] C --> D{"Map Stabilized?"} D -->|No| B D -->|Yes| E["Final Map Created"] end style B fill:#e1f5fe style C fill:#fff9c4 style E fill:#c8e6c9

E-Step: The “Where am I?” Phase

We look at every point and ask: “Given our current map of the hills, how likely is it that you belong to Hill A vs Hill B?”

  • We calculate the Responsibility ($\gamma_{ik}$): The probability that point $i$ implies cluster $k$.
  • If a point is right in the middle of Hill A’s peak, it gets a high probability for A (e.g., 0.99).
  • If it’s in the foggy valley between A and B, it might get split (0.5 for A, 0.5 for B).

$$\gamma_{ik} = \frac{\pi_k \cdot \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}$$

M-Step: The “Move the Map” Phase

Now that we have a better idea of who belongs where, we update the map boundaries to fit the data better.

  1. Update Weights ($\pi_k$): Did Hill A get more points assigned to it? Make Hill A “bigger” (increase weight).
  2. Update Means ($\mu_k$): Calculate the weighted average of all points assigned to Hill A. Move the peak to this new center.
  3. Update Covariances ($\Sigma_k$): Did the points spread out? Widen the hill. Did they bunch up? Narrow the hill.
The Cycle: We repeat this “Guess -> Check -> Update” loop until the map stops changing (Convergence). This guarantees we find a local optimum!

Covariance: The Shape of the Hill

Each cluster (hill) can have a different shape. The covariance_type parameter defines what these hills can look like:

TypeAnalogyShape FlexibilityParameters
sphericalPerfect DomeRound only. Width is same in all directions.Lowest (Simple)
diagOval DomeStretched along axes (North-South or East-West), but not tilted.Medium
fullAny HillAny shape, any tilt. Can be long, thin, and diagonal.Highest (Complex)
tiedClone HillsAll hills must have the exact same shape and size.High
GMM covariance types
Different covariance types: spherical (circles), diagonal (axis-aligned ellipses), full (any ellipse)

Model Selection: Choosing K

Unlike K-Means, GMM provides principled ways to choose $K$:

1. AIC (Akaike Information Criterion) $$AIC = 2k - 2\ln(\hat{L})$$

2. BIC (Bayesian Information Criterion) $$BIC = k\ln(N) - 2\ln(\hat{L})$$

where $k$ = number of parameters, $\hat{L}$ = maximum likelihood.

Occam’s Razor: BIC penalizes complexity ($k \ln N$) more strongly than AIC ($2k$). Use BIC if you want a simpler model with fewer clusters. Use AIC if you care more about fitting the data perfectly, even if it’s complex.

Code Practice

GMM Clustering with sklearn

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture

# Generate sample data
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 1.5, 0.5], random_state=42)

# Fit GMM
# n_components=3: We expect 3 distinct "hills"
# covariance_type='full': The hills can be any shape (tilted, narrow, wide)
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
probs = gmm.predict_proba(X)

print("=" * 50)
print("GMM CLUSTERING RESULTS")
print("=" * 50)
print(f"\n📊 Cluster sizes: {np.bincount(labels)}")
print(f"📈 Mixing weights: {gmm.weights_.round(3)}")
print(f"📐 Converged: {gmm.converged_}")
print(f"🔄 Iterations: {gmm.n_iter_}")

Output:

1
2
3
4
5
6
7
8
==================================================
GMM CLUSTERING RESULTS
==================================================

📊 Cluster sizes: [100 100 100]
📈 Mixing weights: [0.333 0.333 0.333]
📐 Converged: True
🔄 Iterations: 2

Interpreting the Results

  • Mixing Weights: The “volume” of each hill on the map.
  • Means: The peak location of each hill.
  • Converged: The surveyor stopped updating the map because it wasn’t changing anymore.

Soft Clustering Visualization

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Visualize soft assignments
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
colors = ['#e74c3c', '#3498db', '#2ecc71']

# Hard clustering
for k in range(3):
    mask = labels == k
    axes[0].scatter(X[mask, 0], X[mask, 1], c=colors[k], alpha=0.6, s=40, label=f'Cluster {k+1}')
axes[0].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='black', marker='X', s=200, edgecolors='white')
axes[0].set_title('Hard Assignment (predict)', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Soft clustering - color by max probability
max_prob = probs.max(axis=1)
scatter = axes[1].scatter(X[:, 0], X[:, 1], c=max_prob, cmap='RdYlGn', alpha=0.6, s=40)
axes[1].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='black', marker='X', s=200, edgecolors='white')
plt.colorbar(scatter, ax=axes[1], label='Max Probability')
axes[1].set_title('Soft Assignment (probability)', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/gmm_soft_clustering.png', dpi=150)
plt.show()
GMM soft vs hard clustering
Left: Hard assignments. Right: Colored by confidence — uncertain points (yellow) are near cluster boundaries

Comparing Covariance Types

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
covariance_types = ['spherical', 'diag', 'tied', 'full']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for ax, cov_type in zip(axes, covariance_types):
    gmm = GaussianMixture(n_components=3, covariance_type=cov_type, random_state=42)
    labels = gmm.fit_predict(X)
    
    for k in range(3):
        mask = labels == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colors[k], alpha=0.6, s=30)
    
    ax.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='black', marker='X', s=150, edgecolors='white')
    ax.set_title(f'{cov_type.capitalize()} Covariance\nBIC: {gmm.bic(X):.1f}', 
                 fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/covariance_types.png', dpi=150)
plt.show()
Different covariance types
Different covariance types produce different cluster shapes; BIC helps select the best

Model Selection with BIC

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Find optimal number of components
n_components_range = range(1, 10)
bics = []
aics = []

for n in n_components_range:
    gmm = GaussianMixture(n_components=n, random_state=42)
    gmm.fit(X)
    bics.append(gmm.bic(X))
    aics.append(gmm.aic(X))

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(n_components_range, bics, 'bo-', label='BIC', linewidth=2, markersize=8)
ax.plot(n_components_range, aics, 'ro-', label='AIC', linewidth=2, markersize=8)
ax.axvline(x=3, color='green', linestyle='--', label='Optimal K=3')
ax.set_xlabel('Number of Components', fontsize=11)
ax.set_ylabel('Information Criterion', fontsize=11)
ax.set_title('Model Selection: AIC vs BIC', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/gmm_model_selection.png', dpi=150)
plt.show()

print(f"\n📊 Optimal K (BIC): {np.argmin(bics) + 1}")
print(f"📊 Optimal K (AIC): {np.argmin(aics) + 1}")
AIC and BIC for model selection
Both AIC and BIC reach minimum at K=3, confirming our true number of clusters

Deep Dive

When to Use GMM vs K-Means

ScenarioRecommendation
Need probability estimatesGMM
Elliptical clustersGMM (full covariance)
Only spherical clustersK-Means (faster)
Large datasetsK-Means (more scalable)
Need principled K selectionGMM (use BIC)
Overlapping clustersGMM

GMM Limitations

  1. Sensitive to initialization — use multiple restarts (n_init)
  2. Can fail with few samples — needs enough data per cluster
  3. Covariance estimation issues — may hit singular matrices
  4. Assumes Gaussian clusters — fails on non-Gaussian shapes

Frequently Asked Questions

Q1: How do I handle singular covariance matrices?

Options:

  • Add regularization: reg_covar=1e-6
  • Use simpler covariance: covariance_type='diag'
  • Get more data per cluster

Q2: Can GMM detect outliers?

Yes! Points with low probability under all components are potential outliers:

1
2
log_probs = gmm.score_samples(X)
outliers = log_probs < threshold

Q3: What if clusters aren’t Gaussian?

Consider:

  • DBSCAN for arbitrary shapes
  • Kernel density estimation
  • Mixture of other distributions

Summary

ConceptKey Points
GMMProbabilistic soft clustering with Gaussians
EM AlgorithmE-step (responsibilities) + M-step (parameters)
Covariance typesspherical, diag, tied, full
Model selectionUse BIC/AIC (lower is better)
vs K-MeansSofter, more flexible, but slower

References

  • sklearn GMM Documentation
  • Bishop, C. (2006). “Pattern Recognition and Machine Learning” - Chapter 9
  • Dempster, A.P. et al. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm”