ML-06: Linear Regression

Summary
A deep dive into linear regression: understand why squared error works, derive the normal equation step-by-step, explore the geometric intuition behind projections, and implement from scratch.

Learning Objectives

  • Understand regression vs classification
  • Derive the least squares solution
  • Implement linear regression from scratch
  • Extend to multiple features

Theory

Regression vs Classification

TaskOutput TypeExample
ClassificationDiscrete labelsSpam or not spam
RegressionContinuous valuesHouse price prediction

Classification answers “which category?” while regression answers “how much?

Linear Model

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots = \boldsymbol{w}^T \boldsymbol{x}$$

Where $\boldsymbol{x} = (1, x_1, x_2, \ldots)^T$ includes the bias term (intercept).

Breaking it down:

  • $w_0$ is the intercept (baseline prediction when all features are 0)
  • $w_1, w_2, \ldots$ are weights that determine how much each feature contributes
  • Each weight tells the model: “for every 1-unit increase in this feature, change the prediction by this amount”

Why Squared Error?

The goal is to find the “best” line through data points. What makes a line optimal? Consider these error metrics:

Error MetricFormulaProperties
Sum of errors$\sum (y_i - \hat{y}_i)$❌ Positive and negative errors cancel out
Sum of absolute errors$\sum \vert y_i - \hat{y}_i \vert$✓ Works, but not differentiable at 0
Sum of squared errors$\sum (y_i - \hat{y}_i)^2$✓ Differentiable, penalizes large errors more

Key insight: Squared error is preferred because:

  1. It’s always positive (no cancellation)
  2. It’s smooth and differentiable (enables calculus-based optimization)
  3. It penalizes large errors heavily (one big error is worse than several small ones)
House Price Fit
Linear regression finds the best-fit line by minimizing the sum of squared residuals (vertical dashed lines).

Least Squares Objective

Minimize the sum of squared errors:

$$J(\boldsymbol{w}) = \sum_{i=1}^{N} (y_i - \boldsymbol{w}^T \boldsymbol{x}_i)^2 = |\boldsymbol{y} - X\boldsymbol{w}|^2$$

In matrix form, where $X$ is the design matrix (each row is a sample, each column is a feature):

  • $\boldsymbol{y}$ = vector of true values $(y_1, y_2, \ldots, y_N)^T$
  • $X\boldsymbol{w}$ = vector of predictions
  • $|\cdot|^2$ = squared Euclidean norm

Deriving the Closed-Form Solution

The optimal weights can be derived step by step.

Step 1: Expand the objective function

$$J(\boldsymbol{w}) = (\boldsymbol{y} - X\boldsymbol{w})^T(\boldsymbol{y} - X\boldsymbol{w})$$

Expanding the product:

$$J(\boldsymbol{w}) = \boldsymbol{y}^T\boldsymbol{y} - 2\boldsymbol{w}^T X^T \boldsymbol{y} + \boldsymbol{w}^T X^T X \boldsymbol{w}$$

Step 2: Take the gradient with respect to $\boldsymbol{w}$

Using matrix calculus rules:

  • $\frac{\partial}{\partial \boldsymbol{w}}(\boldsymbol{w}^T \boldsymbol{a}) = \boldsymbol{a}$
  • $\frac{\partial}{\partial \boldsymbol{w}}(\boldsymbol{w}^T A \boldsymbol{w}) = 2A\boldsymbol{w}$ (when $A$ is symmetric)

$$\nabla_w J = -2X^T\boldsymbol{y} + 2X^T X\boldsymbol{w}$$

Step 3: Set gradient to zero and solve

$$-2X^T\boldsymbol{y} + 2X^T X\boldsymbol{w}^* = 0$$

$$X^T X \boldsymbol{w}^* = X^T \boldsymbol{y}$$

$$\boldsymbol{w}^* = (X^T X)^{-1} X^T \boldsymbol{y}$$

This is the normal equation — a direct, closed-form solution for linear regression.

Geometric Interpretation

The prediction $\hat{\boldsymbol{y}} = X\boldsymbol{w}$ is the projection of $\boldsymbol{y}$ onto the column space of $X$.

Geometric interpretation of least squares
The prediction ŷ is the orthogonal projection of y onto the column space of X, with the residual (y - ŷ) perpendicular to the plane.

Why projection? The intuition:

  • The column space of $X$ contains all possible predictions the model can make
  • The true $\boldsymbol{y}$ might not be in this space (data is rarely perfectly linear)
  • The closest point in this space to $\boldsymbol{y}$ is its orthogonal projection
  • “Closest” in Euclidean distance = minimizing squared error!

The orthogonality condition: The residual vector $(\boldsymbol{y} - \hat{\boldsymbol{y}})$ is perpendicular to every column of $X$:

$$X^T(\boldsymbol{y} - X\boldsymbol{w}) = 0$$

This is exactly the normal equation rearranged—geometry and calculus yield the same answer.

Understanding R² Score

The coefficient of determination ($R^2$) measures how well the model explains the variance in the data:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Where:

  • $SS_{res}$ = Residual sum of squares (unexplained variance)
  • $SS_{tot}$ = Total sum of squares (total variance)
  • $\bar{y}$ = mean of $y$

Interpreting R²:

R² ValueInterpretation
1.0Perfect fit — model explains all variance
0.8Good — 80% of variance explained
0.5Moderate — half the variance explained
0.0Model is no better than predicting the mean
< 0Model is worse than predicting the mean

Code Practice

From Scratch Implementation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import numpy as np

class LinearRegressionScratch:
    def __init__(self):
        self.w = None
    
    def fit(self, X, y):
        # Add bias column
        X_b = np.c_[np.ones(len(X)), X]
        # Normal equation
        self.w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        return self
    
    def predict(self, X):
        X_b = np.c_[np.ones(len(X)), X]
        return X_b @ self.w

Example: House Price

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)  # Size in 1000 sqft
y = 4 + 3 * X.ravel() + np.random.randn(100)  # Price = 4 + 3*size + noise

# Train
model = LinearRegressionScratch()
model.fit(X, y)

print(f"Intercept: {model.w[0]:.2f}")
print(f"Slope: {model.w[1]:.2f}")

# Plot
plt.scatter(X, y, alpha=0.6)
X_line = np.array([[0], [2]])
plt.plot(X_line, model.predict(X_line), 'r-', linewidth=2)
plt.xlabel('Size (1000 sqft)')
plt.ylabel('Price ($10k)')
plt.title('Linear Regression: House Price')
plt.savefig('assets/linear_regression.png')

Output:

1
2
Intercept: 4.22
Slope: 2.77
Linear Regression: House Price
Linear Regression: House Price

Multiple Linear Regression

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

# Feature importance
for name, coef in zip(diabetes.feature_names, model.coef_):
    print(f"{name:>6}: {coef:>7.2f}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
R² Score: 0.480
RMSE: 52.27
   age:  -49.11
   sex: -187.26
   bmi:  552.55
    bp:  294.34
    s1: -932.06
    s2:  556.03
    s3:  161.11
    s4:  191.75
    s5:  763.95
    s6:  148.11

Deep Dive

Q1: When does the normal equation fail?

The normal equation $\boldsymbol{w}^* = (X^T X)^{-1} X^T \boldsymbol{y}$ requires inverting $X^T X$, which can fail or be problematic in several cases:

ProblemCauseSolution
Singular matrixFeatures are linearly dependent (e.g., feature_3 = 2 × feature_1)Remove redundant features or use regularization
Near-singular matrixFeatures are highly correlated (multicollinearity)Use Ridge regression (L2 regularization)
Large datasetMatrix inversion is $O(n^3)$, slow for millions of samplesUse gradient descent (iterative, $O(n)$ per step)
More features than samples$X^T X$ is not full rank when $p > n$Use regularization or dimensionality reduction

Practical rule of thumb: Use normal equation for small datasets (< 10,000 samples), gradient descent for large ones.

Q2: What does a negative coefficient mean?

A negative weight indicates an inverse relationship: as that feature increases, the prediction decreases (holding other features constant).

Examples:

  • House prices: “years since renovation” → negative (older renovation = lower price)
  • Test scores: “hours of distraction” → negative (more distraction = lower score)
  • Fuel efficiency: “vehicle weight” → negative (heavier = less efficient)
Be careful with interpretation! Coefficients show correlation in the model, not necessarily causation. A negative coefficient for “age” in a house price model might reflect other factors correlated with age.

Q3: Linear regression vs. correlation?

These concepts are related but serve different purposes:

AspectCorrelation (r)Linear Regression
PurposeMeasure relationship strengthPredict values
OutputSingle number (-1 to 1)Prediction function $\hat{y} = w^T x$
DirectionalitySymmetric (r(X,Y) = r(Y,X))Asymmetric (X predicts Y)
UnitsUnitlessSame units as Y
Relationship$R^2 = r^2$ for simple linear regression

Key insight: The correlation coefficient $r$ measures how tightly points cluster around any best-fit line. The relationship $R^2 = r^2$ shows the proportion of variance explained—both convey the same information in different forms.

Q4: How to handle categorical features?

Linear regression works with numbers, so categorical features must be encoded:

MethodExampleWhen to use
One-hot encodingColor: [Red, Blue, Green] → [1,0,0], [0,1,0], [0,0,1]Nominal categories (no order)
Ordinal encodingSize: [S, M, L] → [1, 2, 3]Ordinal categories (with order)
Don’t use ordinal encoding for nominal categories! Assigning Red=1, Blue=2, Green=3 implies Blue is “between” Red and Green, which is meaningless.

Summary

ConceptFormula/Meaning
Linear Model$\hat{y} = \boldsymbol{w}^T \boldsymbol{x}$
Loss Function$|y - X w|^2$ (squared error)
Solution$w = (X^T X)^{-1} X^T y$
R² ScoreProportion of variance explained

References