ML-09: K-Nearest Neighbors (KNN)

Publish on: 2022/04/14 Classify at: CODE/Supervised Machine Learning

Words: 2366 Read:≈ 12min

Summary

Master the simplest yet powerful classification algorithm: understand distance metrics, the curse of dimensionality, optimal K selection, and implement KNN from scratch with stunning decision boundary visualizations.

Learning Objectives

Understand the KNN algorithm and instance-based learning
Master common distance metrics
Learn how to choose the optimal K value
Recognize the curse of dimensionality

Theory

What is K-Nearest Neighbors?

K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. Unlike models that learn complex decision boundaries, KNN makes predictions based on a simple principle:

To classify a new point, find its K closest neighbors and let them vote.

graph LR
    A["❓ New Point"] --> B["Find K=3 Nearest Neighbors"]
    B --> C["🔵 Neighbor 1: Class A"]
    B --> D["🔵 Neighbor 2: Class A"]
    B --> E["🔴 Neighbor 3: Class B"]
    C --> F["🗳️ Vote: 2A vs 1B"]
    D --> F
    E --> F
    F --> G["✅ Predict: Class A"]
    
    style G fill:#c8e6c9

Lazy Learning vs Eager Learning

KNN is a lazy learner (also called instance-based learning):

Aspect	Lazy Learning (KNN)	Eager Learning (Logistic Regression, SVM)
Training	Just store the data	Build a model from data
Prediction	Compute distances at query time	Use pre-built model
Memory	Stores entire dataset	Stores only model parameters
Adaptability	Easy to add new data	Need to retrain

No explicit training phase! KNN simply memorizes all training examples. All the work happens during prediction.

Distance Metrics

Since KNN relies on finding the “nearest” neighbors, how we measure distance between points is crucial. The choice of distance metric can significantly impact model performance.

Euclidean Distance (L2)

The most common metric — straight-line distance in n-dimensional space:

$$d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$$

Example	$x$	$y$	Distance
2D	(0, 0)	(3, 4)	$\sqrt{9 + 16} = 5$
3D	(1, 2, 3)	(4, 6, 3)	$\sqrt{9 + 16 + 0} = 5$

Manhattan Distance (L1)

Sum of absolute differences — distance along grid lines:

$$d(x, y) = \sum_{i=1}^{n} |x_i - y_i|$$

Example	$x$	$y$	Distance
2D	(0, 0)	(3, 4)	$\|3\| + \|4\| = 7$

When to use Manhattan? When features are on different scales or when movement is restricted to axes (like city blocks).

Chebyshev Distance (L∞)

The maximum absolute difference across any dimension:

$$d(x, y) = \max_{i} |x_i - y_i|$$

Example	$x$	$y$	Distance
2D	(0, 0)	(3, 4)	$\max(3, 4) = 4$

When to use Chebyshev? When the largest single difference matters most, like a king’s movement in chess (can move diagonally in one step).

Minkowski Distance (Generalized)

Euclidean, Manhattan, and Chebyshev are all special cases of the Minkowski distance:

$$d(x, y) = \left(\sum_{i=1}^{n} |x_i - y_i|^p\right)^{1/p}$$

$p$	Name	Unit Ball Shape	Description
1	Manhattan (L1)	◇ Diamond	Sum of absolute differences
2	Euclidean (L2)	○ Circle	Straight-line distance
∞	Chebyshev (L∞)	□ Square	Maximum single difference

Unit Ball Shape refers to the contour of all points at distance = 1 from the origin. This visualizes how each metric defines “equal distance”.

Unit ball shapes for Manhattan, Euclidean, and Chebyshev distances — Unit ball shapes: Manhattan forms a diamond (rotated square), Euclidean forms a circle, and Chebyshev forms a square. Points on each boundary are equidistant from the origin under their respective metrics.

Comparison

Metric	Best For	Sensitivity to Outliers
Euclidean	Default choice, continuous features	High
Manhattan	High-dimensional, sparse data	Lower
Chebyshev	When max difference matters	Very high

The KNN Algorithm

With distance metrics covered, the next step is to see how KNN uses them to make predictions:

Input: Training data ${(x_1, y_1), …, (x_N, y_N)}$, query point $x_q$, number of neighbors $K$

Algorithm:

Compute distance from $x_q$ to all training points
Sort by distance (ascending)
Select the $K$ nearest neighbors
Classification: Return majority class among neighbors
Regression: Return average (or weighted average) of neighbor values

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
function KNN_Classify(X_train, y_train, x_query, K):
    distances = []
    for i = 1 to N:
        d = distance(x_query, X_train[i])
        distances.append((d, y_train[i]))
    
    distances.sort()  # Sort by distance
    
    neighbors = distances[:K]  # Take K nearest
    
    return majority_vote(neighbors)

Choosing K: The Bias-Variance Tradeoff

The choice of $K$ dramatically affects model behavior:

K Value	Behavior	Risk
K = 1	Very flexible, follows training data closely	Overfitting — sensitive to noise
K = N	Predicts majority class for everything	Underfitting — ignores local structure
K = √N	Rule of thumb starting point	Balance

Practical guidelines:

Use odd K for binary classification (avoids ties)
Start with $K = \sqrt{N}$ or $K = 5$
Cross-validate to find optimal K
Typical range: 1-20

Weighted KNN

In standard KNN, all K neighbors have equal influence on the prediction. But intuitively, closer neighbors should matter more — this is the idea behind Weighted KNN:

$$\hat{y} = \frac{\sum_{i \in N_K} w_i \cdot y_i}{\sum_{i \in N_K} w_i}$$

where $w_i = \frac{1}{d(x_q, x_i)^2}$ (inverse distance weighting)

Method	Closer Neighbors	Farther Neighbors
Uniform	Same weight	Same weight
Distance-weighted	High weight	Low weight

The Curse of Dimensionality

KNN works beautifully in low dimensions, but faces serious challenges as the number of features grows. This is known as the curse of dimensionality:

Distances become meaningless: In high dimensions, all points become roughly equidistant
Data becomes sparse: Need exponentially more data to cover the space
Computation explodes: $O(N \cdot d)$ for each query

Dimensions	Points needed to fill space	Nearest neighbor reliability
2D	100	✅ Good
10D	$10^{10}$	⚠️ Degraded
100D	$10^{100}$	❌ Meaningless

Solutions:

Reduce dimensions (PCA, t-SNE)
Use feature selection
Consider other algorithms for $d > 20$

Code Practice

This section implements KNN from scratch, then demonstrates sklearn for real-world applications.

KNN from Scratch

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from collections import Counter

class KNNClassifier:
    def __init__(self, k=3, metric='euclidean'):
        self.k = k
        self.metric = metric
    
    def fit(self, X, y):
        """Store training data (lazy learning!)"""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self
    
    def _distance(self, x1, x2):
        """Compute distance between two points."""
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))
    
    def _predict_one(self, x):
        """Predict class for a single sample."""
        # Compute distances to all training points
        distances = [self._distance(x, x_train) for x_train in self.X_train]
        
        # Get indices of K nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        
        # Get labels of K nearest neighbors
        k_labels = self.y_train[k_indices]
        
        # Return majority vote
        return Counter(k_labels).most_common(1)[0][0]
    
    def predict(self, X):
        """Predict classes for all samples."""
        return np.array([self._predict_one(x) for x in X])
    
    def score(self, X, y):
        """Compute accuracy."""
        return np.mean(self.predict(X) == y)


# Test with Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

knn = KNNClassifier(k=5)
knn.fit(X_train, y_train)

print(f"Training samples: {len(X_train)}")
print(f"Test accuracy: {knn.score(X_test, y_test):.2%}")

Output:

1
2
Training samples: 120
Test accuracy: 100.00%

sklearn KNN Example

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)

# Evaluate
train_acc = knn.score(X_train, y_train)
test_acc = knn.score(X_test, y_test)
cv_scores = cross_val_score(knn, X, y, cv=5)

print("📊 KNN Performance on Iris Dataset")
print("=" * 40)
print(f"Train accuracy:      {train_acc:.2%}")
print(f"Test accuracy:       {test_acc:.2%}")
print(f"Cross-val accuracy:  {cv_scores.mean():.2%} ± {cv_scores.std():.2%}")

Output:

1
2
3
4
5
📊 KNN Performance on Iris Dataset
========================================
Train accuracy:      96.67%
Test accuracy:       100.00%
Cross-val accuracy:  97.33% ± 2.49%

Finding Optimal K with Cross-Validation

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# Test different K values
k_range = range(1, 31)
cv_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find optimal K
optimal_k = k_range[np.argmax(cv_scores)]
best_score = max(cv_scores)

print(f"🎯 Optimal K: {optimal_k}")
print(f"📈 Best CV Score: {best_score:.2%}")
print(f"\n📊 Top 5 K values:")
for k, score in sorted(zip(k_range, cv_scores), key=lambda x: -x[1])[:5]:
    print(f"   K={k:2d}: {score:.2%}")

Output:

1
2
3
4
5
6
7
8
9
🎯 Optimal K: 6
📈 Best CV Score: 98.00%

📊 Top 5 K values:
   K= 6: 98.00%
   K= 7: 98.00%
   K=10: 98.00%
   K=11: 98.00%
   K=12: 98.00%

Decision Boundary Visualization

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification

# Generate 2D data for visualization
np.random.seed(42)
X, y = make_classification(
    n_samples=150, n_features=2, n_redundant=0,
    n_informative=2, n_clusters_per_class=1, 
    class_sep=1.5, random_state=42
)

fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))
k_values = [1, 5, 15]

for ax, k in zip(axes, k_values):
    # Train KNN
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    
    # Create mesh grid
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    ax.scatter(X[y==0, 0], X[y==0, 1], c='#E53935', s=50, 
               edgecolors='white', label='Class 0')
    ax.scatter(X[y==1, 0], X[y==1, 1], c='#1E88E5', s=50, 
               edgecolors='white', label='Class 1')
    
    acc = knn.score(X, y)
    ax.set_title(f'K = {k}\nAccuracy: {acc:.1%}', fontsize=12)
    ax.grid(True, alpha=0.3)

axes[0].legend(loc='upper left')
fig.suptitle('Effect of K on KNN Decision Boundary', fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.savefig('assets/knn_decision_boundary.png', dpi=150)
plt.show()

KNN decision boundaries for different K values — Effect of K on decision boundary: K=1 creates a very complex boundary that follows each point (overfitting), K=5 is balanced, K=15 produces a smoother boundary.

Distance Metrics Comparison

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Scale features (important for KNN!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_s, X_test_s, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

print("📊 Distance Metric Comparison")
print("=" * 50)
print(f"{'Metric':<15} {'Train Acc':>12} {'Test Acc':>12}")
print("-" * 50)

for metric in ['euclidean', 'manhattan', 'chebyshev', 'minkowski']:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train_s, y_train)
    
    train_acc = knn.score(X_train_s, y_train)
    test_acc = knn.score(X_test_s, y_test)
    
    print(f"{metric:<15} {train_acc:>12.2%} {test_acc:>12.2%}")

Output:

1
2
3
4
5
6
7
8
📊 Distance Metric Comparison
==================================================
Metric             Train Acc     Test Acc
--------------------------------------------------
euclidean             98.33%      100.00%
manhattan             99.17%      100.00%
chebyshev             98.33%      100.00%
minkowski             98.33%      100.00%

Deep Dive

Here are answers to common questions about KNN that often come up in practice.

Frequently Asked Questions

Q1: Should I scale features before KNN?

Yes, absolutely! KNN is distance-based, so features on larger scales dominate.

Feature Scaling	Impact on KNN
Not scaled	Features with larger range dominate distance
StandardScaler	Zero mean, unit variance — equal contribution
MinMaxScaler	[0, 1] range — also works well

1
2
3
4
5
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use same scaler!

Always fit the scaler on training data only, then transform both train and test sets.

Q2: KNN vs other classifiers — when to use which?

Scenario	Recommendation	Why
Small dataset, few features	KNN	Simple, effective
Large dataset (N > 100k)	Logistic Regression, Random Forest	KNN is too slow
High-dimensional (d > 20)	SVM, Random Forest	Curse of dimensionality
Need probability outputs	Logistic Regression	KNN probabilities are rough
Non-linear boundaries	KNN, SVM (RBF)	KNN handles complex shapes
Interpretability needed	Decision Tree	KNN is hard to explain

Q3: How to speed up KNN?

Method	Description	Speedup
KD-Tree	Tree structure for nearest neighbor search	10-100x
Ball Tree	Works better in higher dimensions	10-100x
Approximate NN	Trade accuracy for speed (Annoy, FAISS)	100-1000x
Reduce dimensions	PCA before KNN	Depends on compression

1
2
3
4
5
# sklearn automatically chooses algorithm
knn = KNeighborsClassifier(n_neighbors=5, algorithm='auto')

# Force specific algorithm
knn = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

Q4: KNN for regression?

KNN works for regression too — just average the neighbor values:

$$\hat{y} = \frac{1}{K} \sum_{i \in N_K} y_i$$

1
2
3
4
5
from sklearn.neighbors import KNeighborsRegressor

reg = KNeighborsRegressor(n_neighbors=5, weights='distance')
reg.fit(X_train, y_train)
predictions = reg.predict(X_test)

Q5: How to handle ties in voting?

Strategy	Description
Use odd K	For binary classification, avoids ties
Random selection	Pick randomly among tied classes
Weight by distance	Closer points break the tie
Increase K	Until tie is broken

sklearn uses random selection for ties by default.

Practical Tips

KNN Checklist:

Scale features — always!
Start with K=5 and cross-validate
Use odd K for binary problems
Consider weighted voting for imbalanced data
Check dimensions — avoid d > 20

Common Pitfalls

Pitfall	Symptom	Solution
Unscaled features	Poor accuracy, one feature dominates	Use StandardScaler
Too small K	Overfitting, noisy predictions	Increase K
Too large K	Underfitting, ignores local patterns	Decrease K
High dimensions	Slow, poor accuracy	Reduce dimensions
Imbalanced classes	Predicts majority class	Use weighted KNN
Slow prediction	Large N or d	Use KD-tree, reduce data

Summary

Key Formulas

Concept	Formula
Euclidean Distance	$d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$
Manhattan Distance	$d(x, y) = \sum_i \|x_i - y_i\|$
Classification	$\hat{y} = \text{mode}(y_{N_K})$
Regression	$\hat{y} = \frac{1}{K} \sum_{i \in N_K} y_i$

Key Takeaways

KNN is instance-based — no training, all work at prediction time
Distance metrics matter — Euclidean is default, Manhattan for sparse data
K controls complexity — small K overfits, large K underfits
Feature scaling is essential — unscaled features break KNN
Curse of dimensionality — KNN struggles in high dimensions
Simple but powerful — excellent baseline, competitive on small datasets

When to Use KNN

✅ Use When	❌ Avoid When
Small dataset (N < 10k)	Large dataset (N > 100k)
Low dimensions (d < 20)	High dimensions (d > 50)
Non-linear patterns	Linear relationships
Quick baseline needed	Fast prediction required
Memory is available	Memory constrained

References

Cover, T. & Hart, P. (1967). “Nearest Neighbor Pattern Classification”
sklearn KNeighborsClassifier
Hastie, T. et al. “The Elements of Statistical Learning” - Chapter 13
Beyer, K. et al. (1999). “When Is ‘Nearest Neighbor’ Meaningful?”