2024-07-13
★★★☆☆
30 min
Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.
Ridge regression introduces a regularization term that penalizes large coefficients, helping to stabilize the model and prevent overfitting. This regularization term, also known as the L2 penalty, adds a constraint to the optimization process, influencing the model to choose smaller coefficients for the predictors. By striking a balance between fitting the data well and keeping the coefficients in check, ridge regression proves valuable in improving the robustness and performance of linear regression models, especially in situations with multicollinearity.

Feature standardization is a preprocessing step in machine learning where the input features are transformed to have a mean of 0 and a standard deviation of 1. This is typically achieved by subtracting the mean of each feature from its values and then dividing by the standard deviation.
If the population mean and population standard deviation are known, a raw variable \(x\) is converted into a standard score by
$$z=\frac{x-\mu}{\sigma}$$
Let's try to understand this with an example:
If one of the variables is the price of an apartment (in hundreds of thousands) and the other is the number of rooms in that apartment (in units), it's difficult to compare both quantities. After standardization, their variables will have similar values (though somewhat abstract), but their distribution won't change.
The primary reasons for feature standardization are:
• Magnitude Consistency:
Machine learning models that rely on distances or gradients, such as gradient descent-based optimization algorithms, are sensitive to the scale of the input features. Standardizing features ensures that all features contribute equally to the model, preventing some features from dominating others based solely on their scale.
• Convergence Speed:
Standardizing features can lead to faster convergence during the training of models. Optimization algorithms often converge more quickly when the features are on a similar scale, as it helps the algorithm navigate the parameter space more efficiently.
• Numerical Stability:
Standardization can enhance the numerical stability of computations. Large-scale differences in the ranges of features may lead to numerical precision issues, especially in models that involve matrix operations or exponentiation.
• Regularization Effectiveness:
In models that involve regularization, such as ridge regression or lasso regression, feature standardization ensures that the regularization term applies uniformly to all features. This helps prevent the model from assigning disproportionately large weights to certain features.
While feature standardization is beneficial in many cases, it may not be necessary for all machine learning algorithms. For instance, decision tree-based models are generally insensitive to the scale of features. However, for algorithms like support vector machines, k-nearest neighbors, and linear models feature standardization is often recommended.

Here, the genetic markers represent variations at specific genomic locations, and the trait is a quantitative measure associated with a particular individual.
Suppose we have a dataset with genetic markers (A, B, C, D, E) as predictors and a trait (T) as the response variable.
| Marker A | Marker B | Marker C | Marker D | Marker E | Trait |
|---|---|---|---|---|---|
| 0.8 | 1.2 | 0.5 | -0.7 | 1.0 | 3.2 |
| 1.0 | 0.8 | -0.4 | 0.5 | -1.2 | 2.5 |
| -0.5 | 0.3 | 1.2 | 0.9 | -0.1 | 1.8 |
| 0.2 | -0.9 | -0.7 | 1.1 | 0.5 | 2.9 |
At the beginning, we need to specify the value of the hyperparameter for our method, which is \(\lambda\).
Let \(\lambda=2\).
Taking data from the example, the matrix \(\mathbf{X}_{\operatorname{RAW}}\) takes the form:
$$
\mathbf{X}_{\operatorname{RAW}} = \left( \begin{array}{ccccc}
0.8 & 1.2 & 0.5 & -0.7 & 1.0 \\
1.0 & 0.8 & -0.4 & 0.5 & -1.2 \\
-0.5 & 0.3 & 1.2 & 0.9 & -0.1 \\
0.2 & -0.9 & -0.7 & 1.1 & 0.5 \\
\end{array} \right)
$$
We remember that before creating a ridge regression model, we need to standardize the predictors so that they all have a mean of zero and a standard deviation of 1. Using a simple formula for the \(z\)-score:
$$z=\frac{x-\mu}{\sigma}$$
We obtain a matrix of data after standardizing the predictors.
$$
\mathbf{X} = \left( \begin{array}{ccccc}
0.72686751 & 1.07733123 & 0.46666667 & -1.64706421 & 1.15845045 \\
1.06892281 & 0.57035183 & -0.73333333 & 0.07161149 & -1.5242769 \\
-1.49649194 & -0.06337243 & 1.4 & 0.64450339 & -0.18291323 \\
-0.29929839 & -1.58431063 & -1.13333333 & 0.93094934 & 0.54873968 \\
\end{array} \right)
$$
Unfortunately, after standardization, the numbers may not be convenient for calculations. Usually, the computer handles the computations for us, but if we want to trace step by step how ridge regression works, we have to deal with inconvenient numbers.
To build a ridge regression model essentially means to find the coefficients \(\beta\) (as \(\beta\) we understand the vector of coefficients \((\beta_1,...\beta_p\))). From our previous considerations, we already know the recipe to obtain them:
$$\beta=(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{Y}$$
So, first, we need to multiply the transposed matrix \(X^T\) by the matrix \(X\), and then add to it the scaled identity matrix \(\mathbf{I}\), scaled by the hyperparameter \(\lambda\) (in our case \(\lambda=2\)).
$$
\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I} = \left( \begin{array}{ccccc}
6. & 1.96175708 & -2.20055576 & -2.36377607 & -0.67780309 \\
1.96175708 & 6. & 1.79132721 & -3.24934663 & -0.47912173 \\
-2.20055576 & 1.79132721 & 6. & -0.97391623 & 0.78042977 \\
-2.36377607 & -3.24934663 & -0.97391623 & 6. & -1.62423735 \\
-0.67780309 & -0.47912173 & 0.78042977 & -1.62423735 & 6. \\
\end{array} \right)
$$
Then we need to invert it.
$$
(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1} = \left( \begin{array}{ccccc}
0.28919553 & -0.07688096 & 0.14132356 & 0.10514749 & 0.03661227 \\
-0.07688096 & 0.30018168 & -0.1046914 & 0.13284326 & 0.06486445 \\
0.14132356 & -0.1046914 & 0.25776258 & 0.03647517 & -0.01604861 \\
0.10514749 & 0.13284326 & 0.03647517 & 0.3137487 & 0.10267557 \\
0.03661227 & 0.06486445 & -0.01604861 & 0.10267557 & 0.2058647 \\
\end{array} \right)
$$
And finally, multiply, first by \(X^T\), and then by \(Y\). We will then obtain:
$$
\beta = \left( \begin{array}{c}
0.1593035 \\
-0.03361389 \\
-0.16299572 \\
-0.13492445 \\
0.19306318 \\
\end{array} \right)
$$
Notice that the results depend on the value of \(\lambda\). If we change \(\lambda\), we will get a different result. In the plot below, we can observe how the estimated values of coefficients \(\beta\) change depending on the chosen parameter \(\lambda\).

We estimated coefficients \(\beta_1, ..., \beta_p\), but what about the intercept term \(\beta_0\)?
We mentioned that assuming the variables have been centered to have a mean of zero before conducting ridge regression, the estimated intercept will take the form:
$$\beta_0=\frac{1}{n}\sum_{i=1}^{n}y_i$$
In our case
$$\beta_0=\frac{3.2+2.5+1.8+2.9}{4}=2.6$$
Now, to calculate predictions, we need to apply the formula:
$$\mathbf{\hat{Y}} = \mathbf{X} \beta +\beta_0$$
If we wanted to predict the values of the target variable using our model, we would obtain the following values:
$$
\mathbf{\hat{Y}} = \mathbf{X} \beta +\beta_0=$$
$$=\left( \begin{array}{ccccc}
0.72686751 & 1.07733123 & 0.46666667 & -1.64706421 & 1.15845045 \\
1.06892281 & 0.57035183 & -0.73333333 & 0.07161149 & -1.5242769 \\
-1.49649194 & -0.06337243 & 1.4 & 0.64450339 & -0.18291323 \\
-0.29929839 & -1.58431063 & -1.13333333 & 0.93094934 & 0.54873968 \\
\end{array} \right)
\left( \begin{array}{c}
0.1593035 \\
-0.03361389 \\
-0.16299572 \\
-0.13492445 \\
0.19306318 \\
\end{array} \right)
+
\left( \begin{array}{c}
2.6 \\
2.6 \\
2.6 \\
2.6 \\
2.6 \\
\end{array} \right)
=$$
$$=
\left( \begin{array}{c}
3.04939794 \\
2.56669771 \\
2.01326671 \\
2.77063765
\end{array} \right)
$$
We can now compare our predictions with the actual values to demonstrate that the calculations lead to good results.
$$
\mathbf{\hat{Y}}=
\left( \begin{array}{c}
3.04939794 \\
2.56669771 \\
2.01326671 \\
2.77063765
\end{array} \right)
\hspace{10mm}
\mathbf{Y}=
\left( \begin{array}{c}
3.2 \\
2.5 \\
1.8 \\
2.9
\end{array} \right)
$$
import numpy as np
LAMBDA = 2
X = np.array([[0.8, 1.2, 0.5, -0.7, 1.0],
[1.0, 0.8, -0.4, 0.5, -1.2],
[-0.5, 0.3, 1.2, 0.9, -0.1],
[0.2, -0.9, -0.7, 1.1, 0.5]])
y = np.array([3.2, 2.5, 1.8, 2.9])
X_scale = (X-X.mean(axis=0))/X.std(axis=0)
# X*X^T + LAMBDA*I
x1 = np.matmul(X_scale.T, X_scale) + LAMBDA*np.identity(5)
# Transpose obtained matrix - (X*X^T + LAMBDA*I)^{-1}
x1_inv = np.linalg.inv(x1)
# ( (X*X^T + LAMBDA*I)^{-1} ) * X^T
x2 = np.matmul(x1_inv, X_scale.T)
# ( ( (X*X^T + LAMBDA*I)^{-1} ) * X^T ) * Y
coef = np.matmul(x2, y)
# Estimated coeficients
print(coef)
[0.1593035 -0.03361389 -0.16299572 -0.13492445 0.19306318]
np.matmul(X_scale, coef)+y.mean()
import numpy as np
LAMBDA = 2 # shrinkage parameter
# Define dataset (X,y)
X = np.array([[0.8, 1.2, 0.5, -0.7, 1.0],
[1.0, 0.8, -0.4, 0.5, -1.2],
[-0.5, 0.3, 1.2, 0.9, -0.1],
[0.2, -0.9, -0.7, 1.1, 0.5]])
y = np.array([3.2, 2.5, 1.8, 2.9])
# Scale predictors
X_scale = (X-X.mean(axis=0))/X.std(axis=0)
# RIDGE REGRESSION MODEL - coefficients estimation
# X*X^T + LAMBDA*I
x1 = np.matmul(X_scale.T, X_scale) + LAMBDA*np.identity(5)
# Transpose obtained matrix - (X*X^T + LAMBDA*I)^{-1}
x1_inv = np.linalg.inv(x1)
# ( (X*X^T + LAMBDA*I)^{-1} ) * X^T
x2 = np.matmul(x1_inv, X_scale.T)
# ( ( (X*X^T + LAMBDA*I)^{-1} ) * X^T ) * Y
coef = np.matmul(x2, y)
# Estimated coeficients
print(coef)
# predictions
np.matmul(X_scale, coef)+y.mean()
In this blog post, we explored the concept of ridge regression, a valuable technique in regression analysis, particularly useful when dealing with multicollinearity or situations where the number of predictors exceeds the number of observations. By introducing a regularization term controlled by the hyperparameter \(\lambda\), ridge regression strikes a balance between variance and bias, leading to more stable and reliable models.
We discussed the key steps involved in ridge regression, from standardizing predictors to estimating coefficients and making predictions. Through simple examples and explanations, we demonstrated how ridge regression works, emphasizing its ability to handle challenging scenarios such as high-dimensional data or correlated predictors.
Furthermore, we highlighted the importance of selecting an appropriate value for the hyperparameter \(\lambda\) and showcased how different values of \(\lambda\) influence the estimated coefficients and, consequently, the model's predictions.
Overall, ridge regression offers a powerful tool for improving the robustness and performance of linear regression models, making it a valuable technique in the toolkit of data scientists and analysts working with regression problems. By understanding its principles and applications, practitioners can leverage ridge regression to build more accurate and reliable predictive models in various domains.
Happy learning! 😁
info@machinelearningsensei.com
© machinelearningsensei.com. All Rights Reserved. Designed by HTML Codex