UML-05: Gaussian Mixture Models

Publish on: 2022/10/05 Classify at: CODE/Unsupervised Machine Learning

Words: 1574 Read:≈ 8min

Summary

Master GMM: The 'Explorer in the Fog' of clustering. Learn how to map overlapping 'hills' of data, handle uncertainty with soft clustering, and master the EM algorithm.

Learning Objectives

After reading this post, you will be able to:

Understand Gaussian Mixture Models as a probabilistic clustering approach
Know the difference between hard and soft clustering
Understand the Expectation-Maximization (EM) algorithm
Choose between GMM and K-Means for your clustering tasks

Theory

The Intuition: The Explorer in the Fog

K-Means (Clear Day): You are mapping islands in bright sunlight. The boundaries are sharp. You are either on Island A or Island B.
GMM (Foggy Terrain): You are mapping hills in a thick fog. You can’t see the edges clearly. You stand at a point and say: “I am 70% sure I’m on Hill A, but there’s a 30% chance I’m actually on the lower slope of Hill B.”

Gaussian Mixture Models (GMM) embrace this uncertainty using soft clustering.

graph TD subgraph "K-Means (Hard Borders)" A1["Point X"] --> B1["Cluster 1 ✓"] A1 -.-> C1["Cluster 2 ✗"] end subgraph "GMM (Soft Gradients)" A2["Point X"] --> B2["Cluster 1: 70%"] A2 --> C2["Cluster 2: 30%"] end style B1 fill:#c8e6c9 style B2 fill:#c8e6c9 style C2 fill:#fff9c4

Key insight: GMM models the data as coming from a mixture of several Gaussian (normal) distributions, each representing a cluster.

The Gaussian Mixture Model

A GMM assumes data is generated from $K$ Gaussian components:

$$p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

where:

$\pi_k$ = mixing coefficient (weight of cluster $k$, $\sum_k \pi_k = 1$)
$\mu_k$ = mean of cluster $k$
$\Sigma_k$ = covariance matrix of cluster $k$
$\mathcal{N}(x | \mu, \Sigma)$ = Gaussian probability density

Think of it as a recipe: A GMM is like a smoothie made of $K$ different fruits.

The Fruit ($\mathcal{N}$): The base flavor (the shape of the distribution).
The Amount ($\pi_k$): How much of each fruit you put in (e.g., 70% Strawberry, 30% Banana).
The Result ($p(x)$): The final mixture that describes your entire dataset.

GMM as mixture of Gaussians — GMM: Each cluster is a Gaussian distribution with its own mean, covariance, and weight

GMM vs K-Means

Aspect	K-Means	GMM
Assignment	Hard (0 or 1)	Soft (probabilities)
Cluster shape	Spherical only	Elliptical (any covariance)
Model	Distance-based	Probabilistic
Parameters	Centroids only	Mean, covariance, weights
Output	Cluster labels	Probabilities + labels

The EM Algorithm

GMM parameters are learned using the Expectation-Maximization (EM) algorithm:

graph LR subgraph Loop ["The Surveyor's Loop (EM Algorithm)"] direction LR A["Guess Initial Locations"] --> B["E-Step: Estimate Membership\n(Where are the points?)"] B --> C["M-Step: Update Map\n(Move peaks to center of mass)"] C --> D{"Map Stabilized?"} D -->|No| B D -->|Yes| E["Final Map Created"] end style B fill:#e1f5fe style C fill:#fff9c4 style E fill:#c8e6c9

E-Step: The “Where am I?” Phase

We look at every point and ask: “Given our current map of the hills, how likely is it that you belong to Hill A vs Hill B?”

We calculate the Responsibility ($\gamma_{ik}$): The probability that point $i$ implies cluster $k$.
If a point is right in the middle of Hill A’s peak, it gets a high probability for A (e.g., 0.99).
If it’s in the foggy valley between A and B, it might get split (0.5 for A, 0.5 for B).

$$\gamma_{ik} = \frac{\pi_k \cdot \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \cdot \mathcal{N}(x_i | \mu_j, \Sigma_j)}$$

M-Step: The “Move the Map” Phase

Now that we have a better idea of who belongs where, we update the map boundaries to fit the data better.

Update Weights ($\pi_k$): Did Hill A get more points assigned to it? Make Hill A “bigger” (increase weight).
Update Means ($\mu_k$): Calculate the weighted average of all points assigned to Hill A. Move the peak to this new center.
Update Covariances ($\Sigma_k$): Did the points spread out? Widen the hill. Did they bunch up? Narrow the hill.

The Cycle: We repeat this “Guess -> Check -> Update” loop until the map stops changing (Convergence). This guarantees we find a local optimum!

Covariance: The Shape of the Hill

Each cluster (hill) can have a different shape. The covariance_type parameter defines what these hills can look like:

Type	Analogy	Shape Flexibility	Parameters
spherical	Perfect Dome	Round only. Width is same in all directions.	Lowest (Simple)
diag	Oval Dome	Stretched along axes (North-South or East-West), but not tilted.	Medium
full	Any Hill	Any shape, any tilt. Can be long, thin, and diagonal.	Highest (Complex)
tied	Clone Hills	All hills must have the exact same shape and size.	High

GMM covariance types — Different covariance types: spherical (circles), diagonal (axis-aligned ellipses), full (any ellipse)

Model Selection: Choosing K

Unlike K-Means, GMM provides principled ways to choose $K$:

1. AIC (Akaike Information Criterion) $$AIC = 2k - 2\ln(\hat{L})$$

2. BIC (Bayesian Information Criterion) $$BIC = k\ln(N) - 2\ln(\hat{L})$$

where $k$ = number of parameters, $\hat{L}$ = maximum likelihood.

Occam’s Razor: BIC penalizes complexity ($k \ln N$) more strongly than AIC ($2k$). Use BIC if you want a simpler model with fewer clusters. Use AIC if you care more about fitting the data perfectly, even if it’s complex.

Code Practice

GMM Clustering with sklearn

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture

# Generate sample data
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 1.5, 0.5], random_state=42)

# Fit GMM
# n_components=3: We expect 3 distinct "hills"
# covariance_type='full': The hills can be any shape (tilted, narrow, wide)
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
probs = gmm.predict_proba(X)

print("=" * 50)
print("GMM CLUSTERING RESULTS")
print("=" * 50)
print(f"\n📊 Cluster sizes: {np.bincount(labels)}")
print(f"📈 Mixing weights: {gmm.weights_.round(3)}")
print(f"📐 Converged: {gmm.converged_}")
print(f"🔄 Iterations: {gmm.n_iter_}")

Output:

1
2
3
4
5
6
7
8
==================================================
GMM CLUSTERING RESULTS
==================================================

📊 Cluster sizes: [100 100 100]
📈 Mixing weights: [0.333 0.333 0.333]
📐 Converged: True
🔄 Iterations: 2

Interpreting the Results

Mixing Weights: The “volume” of each hill on the map.
Means: The peak location of each hill.
Converged: The surveyor stopped updating the map because it wasn’t changing anymore.

Soft Clustering Visualization

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Visualize soft assignments
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
colors = ['#e74c3c', '#3498db', '#2ecc71']

# Hard clustering
for k in range(3):
    mask = labels == k
    axes[0].scatter(X[mask, 0], X[mask, 1], c=colors[k], alpha=0.6, s=40, label=f'Cluster {k+1}')
axes[0].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='black', marker='X', s=200, edgecolors='white')
axes[0].set_title('Hard Assignment (predict)', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Soft clustering - color by max probability
max_prob = probs.max(axis=1)
scatter = axes[1].scatter(X[:, 0], X[:, 1], c=max_prob, cmap='RdYlGn', alpha=0.6, s=40)
axes[1].scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='black', marker='X', s=200, edgecolors='white')
plt.colorbar(scatter, ax=axes[1], label='Max Probability')
axes[1].set_title('Soft Assignment (probability)', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/gmm_soft_clustering.png', dpi=150)
plt.show()

GMM soft vs hard clustering — Left: Hard assignments. Right: Colored by confidence — uncertain points (yellow) are near cluster boundaries

Comparing Covariance Types

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
covariance_types = ['spherical', 'diag', 'tied', 'full']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for ax, cov_type in zip(axes, covariance_types):
    gmm = GaussianMixture(n_components=3, covariance_type=cov_type, random_state=42)
    labels = gmm.fit_predict(X)
    
    for k in range(3):
        mask = labels == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colors[k], alpha=0.6, s=30)
    
    ax.scatter(gmm.means_[:, 0], gmm.means_[:, 1], c='black', marker='X', s=150, edgecolors='white')
    ax.set_title(f'{cov_type.capitalize()} Covariance\nBIC: {gmm.bic(X):.1f}', 
                 fontsize=11, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/covariance_types.png', dpi=150)
plt.show()

Different covariance types produce different cluster shapes; BIC helps select the best

Model Selection with BIC

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Find optimal number of components
n_components_range = range(1, 10)
bics = []
aics = []

for n in n_components_range:
    gmm = GaussianMixture(n_components=n, random_state=42)
    gmm.fit(X)
    bics.append(gmm.bic(X))
    aics.append(gmm.aic(X))

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(n_components_range, bics, 'bo-', label='BIC', linewidth=2, markersize=8)
ax.plot(n_components_range, aics, 'ro-', label='AIC', linewidth=2, markersize=8)
ax.axvline(x=3, color='green', linestyle='--', label='Optimal K=3')
ax.set_xlabel('Number of Components', fontsize=11)
ax.set_ylabel('Information Criterion', fontsize=11)
ax.set_title('Model Selection: AIC vs BIC', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/gmm_model_selection.png', dpi=150)
plt.show()

print(f"\n📊 Optimal K (BIC): {np.argmin(bics) + 1}")
print(f"📊 Optimal K (AIC): {np.argmin(aics) + 1}")

AIC and BIC for model selection — Both AIC and BIC reach minimum at K=3, confirming our true number of clusters

Deep Dive

When to Use GMM vs K-Means

Scenario	Recommendation
Need probability estimates	GMM
Elliptical clusters	GMM (full covariance)
Only spherical clusters	K-Means (faster)
Large datasets	K-Means (more scalable)
Need principled K selection	GMM (use BIC)
Overlapping clusters	GMM

GMM Limitations

Sensitive to initialization — use multiple restarts (n_init)
Can fail with few samples — needs enough data per cluster
Covariance estimation issues — may hit singular matrices
Assumes Gaussian clusters — fails on non-Gaussian shapes

Frequently Asked Questions

Q1: How do I handle singular covariance matrices?

Options:

Add regularization: reg_covar=1e-6
Use simpler covariance: covariance_type='diag'
Get more data per cluster

Q2: Can GMM detect outliers?

Yes! Points with low probability under all components are potential outliers:

1
2
log_probs = gmm.score_samples(X)
outliers = log_probs < threshold

Q3: What if clusters aren’t Gaussian?

Consider:

DBSCAN for arbitrary shapes
Kernel density estimation
Mixture of other distributions

Summary

Concept	Key Points
GMM	Probabilistic soft clustering with Gaussians
EM Algorithm	E-step (responsibilities) + M-step (parameters)
Covariance types	spherical, diag, tied, full
Model selection	Use BIC/AIC (lower is better)
vs K-Means	Softer, more flexible, but slower

References

sklearn GMM Documentation
Bishop, C. (2006). “Pattern Recognition and Machine Learning” - Chapter 9
Dempster, A.P. et al. (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm”