ML-06: Linear Regression

Publish on: 2022/04/11 Classify at: CODE/Supervised Machine Learning

Words: 1380 Read:≈ 7min

Summary

A deep dive into linear regression: understand why squared error works, derive the normal equation step-by-step, explore the geometric intuition behind projections, and implement from scratch.

Learning Objectives

Understand regression vs classification
Derive the least squares solution
Implement linear regression from scratch
Extend to multiple features

Theory

Regression vs Classification

Task	Output Type	Example
Classification	Discrete labels	Spam or not spam
Regression	Continuous values	House price prediction

Classification answers “which category?” while regression answers “how much?”

Linear Model

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots = \boldsymbol{w}^T \boldsymbol{x}$$

Where $\boldsymbol{x} = (1, x_1, x_2, \ldots)^T$ includes the bias term (intercept).

Breaking it down:

$w_0$ is the intercept (baseline prediction when all features are 0)
$w_1, w_2, \ldots$ are weights that determine how much each feature contributes
Each weight tells the model: “for every 1-unit increase in this feature, change the prediction by this amount”

Why Squared Error?

The goal is to find the “best” line through data points. What makes a line optimal? Consider these error metrics:

Error Metric	Formula	Properties
Sum of errors	$\sum (y_i - \hat{y}_i)$	❌ Positive and negative errors cancel out
Sum of absolute errors	$\sum \vert y_i - \hat{y}_i \vert$	✓ Works, but not differentiable at 0
Sum of squared errors	$\sum (y_i - \hat{y}_i)^2$	✓ Differentiable, penalizes large errors more

Key insight: Squared error is preferred because:

It’s always positive (no cancellation)
It’s smooth and differentiable (enables calculus-based optimization)
It penalizes large errors heavily (one big error is worse than several small ones)

House Price Fit — Linear regression finds the best-fit line by minimizing the sum of squared residuals (vertical dashed lines).

Least Squares Objective

Minimize the sum of squared errors:

$$J(\boldsymbol{w}) = \sum_{i=1}^{N} (y_i - \boldsymbol{w}^T \boldsymbol{x}_i)^2 = |\boldsymbol{y} - X\boldsymbol{w}|^2$$

In matrix form, where $X$ is the design matrix (each row is a sample, each column is a feature):

$\boldsymbol{y}$ = vector of true values $(y_1, y_2, \ldots, y_N)^T$
$X\boldsymbol{w}$ = vector of predictions
$|\cdot|^2$ = squared Euclidean norm

Deriving the Closed-Form Solution

The optimal weights can be derived step by step.

Step 1: Expand the objective function

$$J(\boldsymbol{w}) = (\boldsymbol{y} - X\boldsymbol{w})^T(\boldsymbol{y} - X\boldsymbol{w})$$

Expanding the product:

$$J(\boldsymbol{w}) = \boldsymbol{y}^T\boldsymbol{y} - 2\boldsymbol{w}^T X^T \boldsymbol{y} + \boldsymbol{w}^T X^T X \boldsymbol{w}$$

Step 2: Take the gradient with respect to $\boldsymbol{w}$

Using matrix calculus rules:

$\frac{\partial}{\partial \boldsymbol{w}}(\boldsymbol{w}^T \boldsymbol{a}) = \boldsymbol{a}$
$\frac{\partial}{\partial \boldsymbol{w}}(\boldsymbol{w}^T A \boldsymbol{w}) = 2A\boldsymbol{w}$ (when $A$ is symmetric)

$$\nabla_w J = -2X^T\boldsymbol{y} + 2X^T X\boldsymbol{w}$$

Step 3: Set gradient to zero and solve

$$-2X^T\boldsymbol{y} + 2X^T X\boldsymbol{w}^* = 0$$

$$X^T X \boldsymbol{w}^* = X^T \boldsymbol{y}$$

$$\boldsymbol{w}^* = (X^T X)^{-1} X^T \boldsymbol{y}$$

This is the normal equation — a direct, closed-form solution for linear regression.

Geometric Interpretation

The prediction $\hat{\boldsymbol{y}} = X\boldsymbol{w}$ is the projection of $\boldsymbol{y}$ onto the column space of $X$.

Why projection? The intuition:

The column space of $X$ contains all possible predictions the model can make
The true $\boldsymbol{y}$ might not be in this space (data is rarely perfectly linear)
The closest point in this space to $\boldsymbol{y}$ is its orthogonal projection
“Closest” in Euclidean distance = minimizing squared error!

The orthogonality condition: The residual vector $(\boldsymbol{y} - \hat{\boldsymbol{y}})$ is perpendicular to every column of $X$:

$$X^T(\boldsymbol{y} - X\boldsymbol{w}) = 0$$

This is exactly the normal equation rearranged—geometry and calculus yield the same answer.

Understanding R² Score

The coefficient of determination ($R^2$) measures how well the model explains the variance in the data:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Where:

$SS_{res}$ = Residual sum of squares (unexplained variance)
$SS_{tot}$ = Total sum of squares (total variance)
$\bar{y}$ = mean of $y$

Interpreting R²:

R² Value	Interpretation
1.0	Perfect fit — model explains all variance
0.8	Good — 80% of variance explained
0.5	Moderate — half the variance explained
0.0	Model is no better than predicting the mean
< 0	Model is worse than predicting the mean

Code Practice

From Scratch Implementation

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np

class LinearRegressionScratch:
    def __init__(self):
        self.w = None
    
    def fit(self, X, y):
        # Add bias column
        X_b = np.c_[np.ones(len(X)), X]
        # Normal equation
        self.w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        return self
    
    def predict(self, X):
        X_b = np.c_[np.ones(len(X)), X]
        return X_b @ self.w

Example: House Price

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # Size in 1000 sqft
y = 4 + 3 * X.ravel() + np.random.randn(100)  # Price = 4 + 3*size + noise

# Train
model = LinearRegressionScratch()
model.fit(X, y)

print(f"Intercept: {model.w[0]:.2f}")
print(f"Slope: {model.w[1]:.2f}")

# Plot
plt.scatter(X, y, alpha=0.6)
X_line = np.array([[0], [2]])
plt.plot(X_line, model.predict(X_line), 'r-', linewidth=2)
plt.xlabel('Size (1000 sqft)')
plt.ylabel('Price ($10k)')
plt.title('Linear Regression: House Price')
plt.savefig('assets/linear_regression.png')

Output:

1
2
Intercept: 4.22
Slope: 2.77

Multiple Linear Regression

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

# Feature importance
for name, coef in zip(diabetes.feature_names, model.coef_):
    print(f"{name:>6}: {coef:>7.2f}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
R² Score: 0.480
RMSE: 52.27
   age:  -49.11
   sex: -187.26
   bmi:  552.55
    bp:  294.34
    s1: -932.06
    s2:  556.03
    s3:  161.11
    s4:  191.75
    s5:  763.95
    s6:  148.11

Deep Dive

Q1: When does the normal equation fail?

The normal equation $\boldsymbol{w}^* = (X^T X)^{-1} X^T \boldsymbol{y}$ requires inverting $X^T X$, which can fail or be problematic in several cases:

Problem	Cause	Solution
Singular matrix	Features are linearly dependent (e.g., feature_3 = 2 × feature_1)	Remove redundant features or use regularization
Near-singular matrix	Features are highly correlated (multicollinearity)	Use Ridge regression (L2 regularization)
Large dataset	Matrix inversion is $O(n^3)$, slow for millions of samples	Use gradient descent (iterative, $O(n)$ per step)
More features than samples	$X^T X$ is not full rank when $p > n$	Use regularization or dimensionality reduction

Practical rule of thumb: Use normal equation for small datasets (< 10,000 samples), gradient descent for large ones.

Q2: What does a negative coefficient mean?

A negative weight indicates an inverse relationship: as that feature increases, the prediction decreases (holding other features constant).

Examples:

House prices: “years since renovation” → negative (older renovation = lower price)
Test scores: “hours of distraction” → negative (more distraction = lower score)
Fuel efficiency: “vehicle weight” → negative (heavier = less efficient)

Be careful with interpretation! Coefficients show correlation in the model, not necessarily causation. A negative coefficient for “age” in a house price model might reflect other factors correlated with age.

Q3: Linear regression vs. correlation?

These concepts are related but serve different purposes:

Aspect	Correlation (r)	Linear Regression
Purpose	Measure relationship strength	Predict values
Output	Single number (-1 to 1)	Prediction function $\hat{y} = w^T x$
Directionality	Symmetric (r(X,Y) = r(Y,X))	Asymmetric (X predicts Y)
Units	Unitless	Same units as Y
Relationship	$R^2 = r^2$ for simple linear regression	—

Key insight: The correlation coefficient $r$ measures how tightly points cluster around any best-fit line. The relationship $R^2 = r^2$ shows the proportion of variance explained—both convey the same information in different forms.

Q4: How to handle categorical features?

Linear regression works with numbers, so categorical features must be encoded:

Method	Example	When to use
One-hot encoding	Color: [Red, Blue, Green] → [1,0,0], [0,1,0], [0,0,1]	Nominal categories (no order)
Ordinal encoding	Size: [S, M, L] → [1, 2, 3]	Ordinal categories (with order)

Don’t use ordinal encoding for nominal categories! Assigning Red=1, Blue=2, Green=3 implies Blue is “between” Red and Green, which is meaningless.

Summary

Concept	Formula/Meaning
Linear Model	$\hat{y} = \boldsymbol{w}^T \boldsymbol{x}$
Loss Function	$\|y - X w\|^2$ (squared error)
Solution	$w = (X^T X)^{-1} X^T y$
R² Score	Proportion of variance explained

References

Bishop, C. “Pattern Recognition and Machine Learning” - Chapter 3
sklearn Linear Regression Documentation
Montgomery, D. “Introduction to Linear Regression Analysis”