ML-07: Nonlinear Regression and Regularization

Publish on: 2022/04/12 Classify at: CODE/Supervised Machine Learning

Words: 986 Read:≈ 5min

Summary

From straight lines to curves: master polynomial features, understand the bias-variance tradeoff, and tame overfitting with Ridge regularization and cross-validation.

Learning Objectives

Transform features for nonlinear regression
Understand why regularization works
Implement Ridge regression (L2)
Choose the regularization strength

Theory

Polynomial Features

Linear regression can only capture straight lines. To model curves, polynomial features transform the input:

$$x \rightarrow (1, x, x^2, x^3, \ldots, x^d)$$

The model becomes:

$$\hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \cdots + w_d x^d$$

Key insight: Despite the curved output, the model remains linear in the parameters $w$. The same normal equation applies—only the feature matrix changes.

Original Feature	Polynomial Features (degree 3)
$x = 2$	$(1, 2, 4, 8)$
$x = 3$	$(1, 3, 9, 27)$

The Overfitting Problem

Higher-degree polynomials have more flexibility to fit training data, but this comes at a cost:

Degree	Behavior	Training Error	Test Error
Too low (1)	Underfitting — cannot capture the pattern	High	High
Just right (4)	Good fit — captures trend without noise	Low	Low
Too high (15)	Overfitting — memorizes noise	Very low	Very high

Polynomial degree comparison: underfitting, good fit, overfitting — Effect of polynomial degree: degree-1 underfits (high bias), degree-4 balances well, degree-15 overfits (high variance).

The Bias-Variance Tradeoff:

Bias = error from overly simple assumptions (underfitting)
Variance = error from sensitivity to training data fluctuations (overfitting)
The optimal model is complex enough to capture the signal, yet simple enough to avoid fitting noise.

Ridge Regression (L2 Regularization)

Instead of restricting model complexity by limiting degree, regularization adds a penalty term to the loss function:

$$J(\boldsymbol{w}) = |y - Xw|^2 + \lambda |w|^2$$

The first term measures data fit (prediction error), while the second term is the penalty (weight magnitude).

Why does penalizing large weights help?

Large weights enable sharp oscillations in the fitted curve
Constraining weights to remain small produces smoother curves
Smaller weights correspond to simpler, more generalizable models

Comparison of overfit vs regularized polynomial fits — Regularization prevents overfitting: the left curve (degree 15) oscillates wildly, while the right curve smoothly captures the underlying trend.

$\lambda$	Effect	Resulting Model
0	No regularization = ordinary least squares	May overfit
Small	Light constraint on weights	Slight smoothing
Large	Heavy constraint → weights shrink toward 0	May underfit

Closed-Form Solution

The Ridge regression solution has a closed form:

$$\boldsymbol{w}^* = (X^T X + \lambda I)^{-1} X^T y$$

Comparing to standard linear regression:

Method	Solution	Matrix Invertibility
OLS	$(X^T X)^{-1} X^T y$	Can fail if $X^T X$ is singular
Ridge	$(X^T X + \lambda I)^{-1} X^T y$	Always invertible (λI ensures positive definiteness)

💡 Tip: Ridge regression solves two problems at once: it prevents overfitting AND guarantees a solution even when features are collinear.

Code Practice

Polynomial Overfitting Demo

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline

np.random.seed(42)
X = np.sort(np.random.rand(20, 1) * 2, axis=0)
y = np.sin(2 * np.pi * X).ravel() + np.random.randn(20) * 0.3

X_test = np.linspace(0, 2, 100).reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
degrees = [1, 4, 15]

for ax, d in zip(axes, degrees):
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X, y)
    
    ax.scatter(X, y, c='blue')
    ax.plot(X_test, model.predict(X_test), 'r-', linewidth=2)
    ax.set_title(f'Degree {d}')
    ax.set_ylim(-2, 2)

plt.savefig('assets/poly_overfit.png')

Polynomial overfitting demo — Effect of polynomial degree on model fit: degree-1 underfits, degree-4 fits well, degree-15 overfits.

Ridge Regression Comparison

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
alphas = [0, 0.1, 10]  # Regularization strength

for ax, alpha in zip(axes, alphas):
    if alpha == 0:
        model = make_pipeline(PolynomialFeatures(15), LinearRegression())
    else:
        model = make_pipeline(PolynomialFeatures(15), Ridge(alpha=alpha))
    
    model.fit(X, y)
    
    ax.scatter(X, y, c='blue')
    ax.plot(X_test, model.predict(X_test), 'r-', linewidth=2)
    ax.set_title(f'λ = {alpha}')
    ax.set_ylim(-2, 2)

plt.savefig('assets/ridge.png')

Ridge regression comparison — Effect of regularization strength: λ=0 overfits, λ=0.1 smooths slightly, λ=10 produces a stable fit.

Finding Optimal λ with Cross-Validation

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.linear_model import RidgeCV

# Try multiple alpha values
alphas = np.logspace(-4, 4, 100)
ridge_cv = make_pipeline(
    PolynomialFeatures(10),
    RidgeCV(alphas=alphas, cv=5)
)
ridge_cv.fit(X, y)

best_alpha = ridge_cv.named_steps['ridgecv'].alpha_
print(f"Best λ: {best_alpha:.4f}")

Output:

`1`	`Best λ: 31.2572`

Deep Dive

FAQ

Q1: Ridge vs. Lasso — what’s the difference?

Regularization	Penalty	Effect on Weights	Best For
Ridge (L2)	$\lambda \|w\|^2$	Shrinks all weights toward zero	When all features contribute
Lasso (L1)	$\lambda \|w\|_1$	Can set weights exactly to zero	Feature selection
Elastic Net	Both L1 + L2	Combines benefits of both	Many correlated features

Geometric intuition:

Ridge penalty creates a circular constraint region → weights shrink but never reach exactly zero
Lasso penalty creates a diamond-shaped region → solutions tend to land on corners (sparse weights)

Q2: How do I choose between polynomial degrees?

Use cross-validation with the “one standard error rule”:

Try degrees 1 through 10 (or higher)
Compute cross-validation score for each degree
Find the degree with the best average score
Select the simplest model (lowest degree) within 1 standard error of the best

Prefer simpler models when performance is similar. A degree-3 polynomial with slightly higher error is often better than a degree-8 that barely improves accuracy.

Q3: Why is sklearn’s parameter called alpha instead of λ?

This is purely a naming convention. In sklearn:

alpha = $\lambda$ (regularization strength)
Higher alpha = more regularization = smaller weights = smoother curves

Q4: When should I use polynomial regression vs. other nonlinear methods?

Method	Pros	Cons
Polynomial Regression	Simple, interpretable, fast	Can overfit, limited patterns
Splines	Smooth, local flexibility	Harder to interpret
Decision Trees	Captures complex patterns	Prone to overfitting
Neural Networks	Universal approximators	Need lots of data, black box

Summary

Concept	Key Points
Polynomial Features	Enable nonlinear fitting with linear model
L2 Regularization	$\lambda \|w\|^2$ penalty shrinks weights
Ridge Solution	$(X^T X + \lambda I)^{-1} X^T y$
λ Selection	Use cross-validation

References

Hastie, T. et al. “The Elements of Statistical Learning” - Chapter 3
sklearn Ridge Regression
Hoerl, A. & Kennard, R. (1970). “Ridge Regression”