Ridge Regression -Derivation and code from Scratch

Tahera Firdose
6 min readJun 27, 2023

--

Photo by Urban Ulrych

Ridge Regression is a linear regression technique that incorporates a regularization term, known as the L2 penalty, into the ordinary least squares objective function. It addresses the issues of multicollinearity and overfitting by adding a small amount of bias to the model, resulting in more stable and robust predictions.

In Ridge Regression, the goal is to find the best-fit line or hyperplane that minimizes the sum of squared residuals between the predicted and actual values of the response variable. However, in addition to this objective, Ridge Regression also aims to minimize the sum of squared coefficients, where the coefficients represent the weights assigned to each predictor variable.

The Ridge Regression objective function can be expressed as:

minimize: ||y — Xw||² + alpha * ||w||²

where:

Y is the vector of response variable values.

X is the matrix of predictor variables.

w is the vector of coefficient estimates.

alpha is the regularization parameter that controls the strength of the penalty term.

The penalty term, alpha * ||w||², is the L2 norm of the coefficient vector w, multiplied by aplha. It adds a penalty based on the square of the magnitude of the coefficients. By doing so, Ridge Regression shrinks the coefficient estimates towards zero, effectively reducing their impact on the model.

The regularization parameter alpha determines the amount of shrinkage applied to the coefficients. A higher value of alpha increases the amount of shrinkage, leading to smaller coefficient estimates. On the other hand, a alpha value of zero eliminates the penalty term, reverting Ridge Regression to ordinary least squares regression.

Derivation of Ridge Regression using OLS

To derive the Ridge Regression coefficients, we will use the Ordinary Least Squares (OLS) method as a starting point. The OLS method aims to minimize the SSR without any regularization.

In OLS, the coefficient vector w is estimated by minimizing the SSR, which can be expressed as:

minimize: SSR = ||y — Xw||²

To derive the Ridge Regression coefficients, we need to modify the OLS objective function by adding the regularization term alpha * ||w||². The modified objective function becomes:

minimize: SSR + alpha * ||w||²

To find the coefficients that minimize this modified objective function, we differentiate it with respect to w and set the derivative equal to zero. Let’s go through the derivation step by step.

Start with the modified objective function:

minimize: SSR + alpha * ||w||²

Expand the SSR term:

SSR = (y — Xw)^T(y — Xw)

We can expand the equation using the properties of matrix multiplication and transpose:

SSR = (y^T — (Xw)^T)(y — Xw)

= y^Ty — y^TXw — (Xw)^Ty + (Xw)^TXw

Since (Xw)^T = w^TX^T (transpose of a product is the product of transposes in reverse order):

SSR = y^Ty — y^TXw — w^TX^Ty + w^TX^TXw

Using the properties of matrix transpose, we can rewrite w^TX^Ty as (w^TX^Ty)^T = y^TXw

SSR = y^Ty — 2y^TXw + w^TX^TXw

Differentiate the objective function with respect to w:

d/dw (SSR + alpha * ||w||²) = 0

d/dw ((y^Ty — 2y^TXw + w^TX^TXw) + (alpha * ||w||²))

Differentiate the first term SSR with respect to w. Note that SSR is y^Ty — 2y^TXw + w^TX^TXw.

· Differentiating y^Ty with respect to w gives us 0 because it does not depend on w.

· Differentiating -2y^TXw with respect to w gives us -2y^TX.

· Differentiating w^TX^TXw with respect to w requires the chain rule. The derivative is (X^TX + (X^TX)^T)w = 2X^TXw (since X^TX is symmetric).

Differentiate the second term alpha * ||w||² with respect to w. Here, ||w||² represents the L2 norm squared of w.

· Differentiating alpha * ||w||² with respect to w gives us 2alpha * w.

Combine the derivatives and set the equation equal to zero:

-2y^TX + 2X^TXw + 2alpha * w = 0

Rearrange the equation:

2X^TXw + 2alpha * w = 2y^TX

Factor out w:

(X^TX + alpha * I)w = X^Ty

w = inv(X^TX + alpha * I) * X^Ty

This equation provides the Ridge Regression coefficients w that minimize the modified objective function.

Advantages and Limitations of Ridge Regression

Ridge Regression offers several advantages and can be beneficial in certain scenarios:

1. Reducing multicollinearity: Ridge Regression handles multicollinearity effectively by shrinking the coefficients of highly correlated variables. This helps to stabilize the model and reduce the impact of multicollinearity on coefficient estimates.

2. Preventing overfitting: The regularization term in Ridge Regression introduces a penalty on the magnitude of coefficients. This helps to prevent overfitting, especially when dealing with high-dimensional data or when the number of predictors exceeds the number of observations.

3. Bias-variance trade-off: Ridge Regression strikes a balance between bias and variance. By introducing the regularization term, it increases the bias slightly but reduces the variance. This can lead to better generalization performance in some cases.

However, Ridge Regression also has some limitations:

1. Lack of coefficient interpretability: The regularization in Ridge Regression tends to shrink the coefficients towards zero, making them less interpretable. The emphasis is on the relative importance of variables rather than the specific magnitude of coefficients.

2. Choosing the regularization parameter: The selection of the regularization parameter alpha is crucial in Ridge Regression. It determines the amount of shrinkage applied to the coefficients. Choosing an optimal value for alpha requires careful consideration and often involves cross-validation techniques.

3. Invariance to feature scaling: Ridge Regression is not invariant to feature scaling. Therefore, it is essential to scale the predictor variables before applying Ridge Regression to avoid biased coefficient estimates.

Load the Boston dataset

We can see from below the data does not have any null values

Seperate the Independent and Dependent Column

Apply Standard Scaler and Scale the independent features where mean =0 and Standard Devaiation =1

Apply train_test_Split and divide the data 70% into training data and 30% testing data

Apply Linear Regression Model, check the intercept , coefficient and R2 score

Check the adjusted R2 square

Model r2 score is less on the test data so there is chance of overfitting, let’s check this using regularization

Apply Cross Validation and find the best alpha value

WE can see the best alpha value came out to be 24.35

Now lets apply Ridge Regression from sklearn.linear_model and pass the best alpha parameter found above

We can see that there is a slight shift in the coefficients by adding the penalty term

Now lets build the Custom Ridge class from Scratch which will also behave same like Ridge Regression class from sklearn

We can compare the intercept, coefficients, and adjusted R-squared values obtained from both Ridge Regression using scikit-learn’s Ridge class and our custom implementation. The results from both approaches are expected to be identical. By comparing these metrics, we can assess the consistency and accuracy of our custom Ridge Regression function.

Conclusion

Ridge Regression is a powerful regularization technique that addresses the challenges of multicollinearity and overfitting in linear regression models. By introducing a penalty term based on the L2 norm of the coefficient estimates, Ridge Regression strikes a balance between bias and variance, resulting in more robust and stable models.

Follow me on https://www.linkedin.com/in/tahera-firdose/

Github: https://github.com/taherafirdose/100-days-of-Machine-Learning

--

--

Tahera Firdose

Datascience - Knowledge grows exponentially when it is shared