Ridge Regression

Open yfnaji opened this issue 2 months ago • 0 comments

This PR implements the Ridge regression model as part of the RustQuant_ml crate.

Ridge regression extends linear regression by adding an L2 regularisation term to the loss function, penalising large coefficient values to reduce overfitting.

This implementation is designed to closely align with Scikit-Learn’s linear_model.Ridge model. The Ridge regression implementation from Scikit-Learn using the same data as the unit tests in this PR is available here.

Take a features matrix $X$, response vector $\textbf{y}$ and regularisation parameter $\lambda>0$.

The loss function for a Ridge regression model is:

$$ C:=\lVert \mathbf{y} - X\beta \rVert^2_2 + \lambda\lVert\beta\rVert^2_2 $$

The optimal values for $\beta$ have a closed form solution. The loss function above can be written as

$$ \left(\mathbf{y}-X\beta\right)^T\left(\mathbf{y}-X\beta\right) + \lambda\beta^T\beta $$

Expanding gives

$$ \mathbf{y}^T\mathbf{y} -\beta^TX^T\mathbf{y} - \underbrace{\mathbf{y}^TX\beta}*+\underbrace{\beta^TX^TX\beta+\beta^T \ \lambda \cdot I\text{d} \ \beta}_{**} $$

where $I_{\text{d}}$ is the identity matrix.

Note that * is a scaler value, therefore

$$ \mathbf{y}^TX\beta = \left(\mathbf{y}^TX\beta\right)^T = \beta^TX^T\mathbf{y} $$

We can also combine the terms in ** to give:

$$ \beta^TX^T\mathbf{y}+\beta^T \lambda \cdot \ I_\text{d} \ \beta = \beta^T\left(X^TX + \lambda \cdot I_\text{d} \right)\beta $$

Now we can further simplify the loss function:

$$ \mathbf{y}^T\mathbf{y} -\beta^TX^T\mathbf{y} - \beta^TX^T\mathbf{y}+\beta^T\left(X^TX + \lambda \cdot I_\text{d} \right)\beta $$

$$ \Rightarrow \mathbf{y}^T\mathbf{y} -2\beta^TX^T\mathbf{y} + \beta^T\left(X^TX + \lambda \cdot I_\text{d} \right)\beta $$

Now calculate the derivative with respect to $\beta$ and set to $0$ to find the optimal values of $\beta$:

$$ \left.\frac{\partial C}{\partial \beta} \right\vert_{\beta=\hat{\beta}}= -2 X^T \mathbf{y} + \underbrace{2\left(X^TX + \lambda I_{\text{d}}\right)\hat{\beta}}_{***} = 0 $$

Note that *** was derived using the fact that

$$ \frac{\partial}{\partial x}\left[\textbf{x}^TA\textbf{x}\right]=\left(A + A^T\right)\text{x} $$

and if A is symmetric, the above can be simplified to $2A\textbf{x}$.

Solving for $\hat{\beta}$:

$$ \left(X^TX + \lambda I_{\text{d}}\right)\hat{\beta} = X^T \mathbf{y} $$

$$ \Rightarrow \hat{\beta} = \left(X^TX + \lambda I_{\text{d}}\right)^{-1}X^T \mathbf{y} $$

Oct 12 '25 13:10 yfnaji