ML-07: Nonlinear Regression and Regularization

Summary
From straight lines to curves: master polynomial features, understand the bias-variance tradeoff, and tame overfitting with Ridge regularization and cross-validation.

Learning Objectives

  • Transform features for nonlinear regression
  • Understand why regularization works
  • Implement Ridge regression (L2)
  • Choose the regularization strength

Theory

Polynomial Features

Linear regression can only capture straight lines. To model curves, polynomial features transform the input:

$$x \rightarrow (1, x, x^2, x^3, \ldots, x^d)$$

The model becomes:

$$\hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \cdots + w_d x^d$$

Key insight: Despite the curved output, the model remains linear in the parameters $w$. The same normal equation applies—only the feature matrix changes.

Original FeaturePolynomial Features (degree 3)
$x = 2$$(1, 2, 4, 8)$
$x = 3$$(1, 3, 9, 27)$

The Overfitting Problem

Higher-degree polynomials have more flexibility to fit training data, but this comes at a cost:

DegreeBehaviorTraining ErrorTest Error
Too low (1)Underfitting — cannot capture the patternHighHigh
Just right (4)Good fit — captures trend without noiseLowLow
Too high (15)Overfitting — memorizes noiseVery lowVery high
Polynomial degree comparison: underfitting, good fit, overfitting
Effect of polynomial degree: degree-1 underfits (high bias), degree-4 balances well, degree-15 overfits (high variance).

The Bias-Variance Tradeoff:

  • Bias = error from overly simple assumptions (underfitting)
  • Variance = error from sensitivity to training data fluctuations (overfitting)
  • The optimal model is complex enough to capture the signal, yet simple enough to avoid fitting noise.

Ridge Regression (L2 Regularization)

Instead of restricting model complexity by limiting degree, regularization adds a penalty term to the loss function:

$$J(\boldsymbol{w}) = |y - Xw|^2 + \lambda |w|^2$$

The first term measures data fit (prediction error), while the second term is the penalty (weight magnitude).

Why does penalizing large weights help?

  • Large weights enable sharp oscillations in the fitted curve
  • Constraining weights to remain small produces smoother curves
  • Smaller weights correspond to simpler, more generalizable models
Comparison of overfit vs regularized polynomial fits
Regularization prevents overfitting: the left curve (degree 15) oscillates wildly, while the right curve smoothly captures the underlying trend.
$\lambda$EffectResulting Model
0No regularization = ordinary least squaresMay overfit
SmallLight constraint on weightsSlight smoothing
LargeHeavy constraint → weights shrink toward 0May underfit

Closed-Form Solution

The Ridge regression solution has a closed form:

$$\boldsymbol{w}^* = (X^T X + \lambda I)^{-1} X^T y$$

Comparing to standard linear regression:

MethodSolutionMatrix Invertibility
OLS$(X^T X)^{-1} X^T y$Can fail if $X^T X$ is singular
Ridge$(X^T X + \lambda I)^{-1} X^T y$Always invertible (λI ensures positive definiteness)

💡 Tip: Ridge regression solves two problems at once: it prevents overfitting AND guarantees a solution even when features are collinear.

Code Practice

Polynomial Overfitting Demo

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline

np.random.seed(42)
X = np.sort(np.random.rand(20, 1) * 2, axis=0)
y = np.sin(2 * np.pi * X).ravel() + np.random.randn(20) * 0.3

X_test = np.linspace(0, 2, 100).reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
degrees = [1, 4, 15]

for ax, d in zip(axes, degrees):
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X, y)
    
    ax.scatter(X, y, c='blue')
    ax.plot(X_test, model.predict(X_test), 'r-', linewidth=2)
    ax.set_title(f'Degree {d}')
    ax.set_ylim(-2, 2)

plt.savefig('assets/poly_overfit.png')
Polynomial overfitting demo
Effect of polynomial degree on model fit: degree-1 underfits, degree-4 fits well, degree-15 overfits.

Ridge Regression Comparison

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
alphas = [0, 0.1, 10]  # Regularization strength

for ax, alpha in zip(axes, alphas):
    if alpha == 0:
        model = make_pipeline(PolynomialFeatures(15), LinearRegression())
    else:
        model = make_pipeline(PolynomialFeatures(15), Ridge(alpha=alpha))
    
    model.fit(X, y)
    
    ax.scatter(X, y, c='blue')
    ax.plot(X_test, model.predict(X_test), 'r-', linewidth=2)
    ax.set_title(f'λ = {alpha}')
    ax.set_ylim(-2, 2)

plt.savefig('assets/ridge.png')
Ridge regression comparison
Effect of regularization strength: λ=0 overfits, λ=0.1 smooths slightly, λ=10 produces a stable fit.

Finding Optimal λ with Cross-Validation

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.linear_model import RidgeCV

# Try multiple alpha values
alphas = np.logspace(-4, 4, 100)
ridge_cv = make_pipeline(
    PolynomialFeatures(10),
    RidgeCV(alphas=alphas, cv=5)
)
ridge_cv.fit(X, y)

best_alpha = ridge_cv.named_steps['ridgecv'].alpha_
print(f"Best λ: {best_alpha:.4f}")

Output:

1
Best λ: 31.2572

Deep Dive

FAQ

Q1: Ridge vs. Lasso — what’s the difference?

RegularizationPenaltyEffect on WeightsBest For
Ridge (L2)$\lambda |w|^2$Shrinks all weights toward zeroWhen all features contribute
Lasso (L1)$\lambda |w|_1$Can set weights exactly to zeroFeature selection
Elastic NetBoth L1 + L2Combines benefits of bothMany correlated features

Geometric intuition:

  • Ridge penalty creates a circular constraint region → weights shrink but never reach exactly zero
  • Lasso penalty creates a diamond-shaped region → solutions tend to land on corners (sparse weights)

Q2: How do I choose between polynomial degrees?

Use cross-validation with the “one standard error rule”:

  1. Try degrees 1 through 10 (or higher)
  2. Compute cross-validation score for each degree
  3. Find the degree with the best average score
  4. Select the simplest model (lowest degree) within 1 standard error of the best
Prefer simpler models when performance is similar. A degree-3 polynomial with slightly higher error is often better than a degree-8 that barely improves accuracy.

Q3: Why is sklearn’s parameter called alpha instead of λ?

This is purely a naming convention. In sklearn:

  • alpha = $\lambda$ (regularization strength)
  • Higher alpha = more regularization = smaller weights = smoother curves

Q4: When should I use polynomial regression vs. other nonlinear methods?

MethodProsCons
Polynomial RegressionSimple, interpretable, fastCan overfit, limited patterns
SplinesSmooth, local flexibilityHarder to interpret
Decision TreesCaptures complex patternsProne to overfitting
Neural NetworksUniversal approximatorsNeed lots of data, black box

Summary

ConceptKey Points
Polynomial FeaturesEnable nonlinear fitting with linear model
L2 Regularization$\lambda |w|^2$ penalty shrinks weights
Ridge Solution$(X^T X + \lambda I)^{-1} X^T y$
λ SelectionUse cross-validation

References

  • Hastie, T. et al. “The Elements of Statistical Learning” - Chapter 3
  • sklearn Ridge Regression
  • Hoerl, A. & Kennard, R. (1970). “Ridge Regression”