Lasso Regression
Implementation of Lasso Regression
This pull request introduces an implementation of the Lasso regression model as part of the RustQuant_ml crate.
Lasso regression extends linear regression by adding an L1 regularisation term, which encourages sparsity in the coefficient vector, resulting in some coefficients being driven to zero. This in turn enables feature selection and reduces overfitting.
This implementation is designed to closely align with Scikit-Learn’s linear_model.Lasso model. The Lasso implementation from Scikit-Learn using the same data as the unit tests in this PR is available here.
Coordinate Descent Algorithm
The Loss function for Lasso regression is given by:
$$ L := \frac{1}{2n}\left\lVert\textbf{y} - X\beta\right\rVert_2^2 + \lambda\left|\beta\right| $$
We iterate through each column individually, isolating the algorithm to column $j$.
Define the partial residual for column $j$ as:
$$ r_j = \textbf{y} - X\beta + X_j\beta_j $$
where $X_j$ is the $j^{\text{th}}$ column of $X$, and $\beta_j$ is the coefficient corresponding to the $j^{\text{th}}$ feature.
This effectively computes the residuals without the $j^{\text{th}}$ column, allowing us to assess the contribution of feature $j$ to the model.
The objective function becomes:
$$ \frac{1}{2n}\left\lVert r_j - X_j\beta_j\right\rVert_2^2 + \lambda\left|\beta_j\right| $$
which can be expanded as:
$$ \frac{1}{2n}\left(r_j - X_j\beta_j\right)^T\left(r_j - X_j\beta_j\right) + \lambda\left|\beta_j\right| $$
$$ \Rightarrow \frac{1}{2n}\left(r_j^Tr_j - 2\beta_jX_j^Tr_j + \beta_j^2X_j^TX_j\right) + \lambda\left|\beta_j\right| $$
Noting that $\beta_j \in \mathbb{R}$ (and thus $\beta_j^T = \beta_j$), we define:
$$ \rho_j := \frac{1}{n}X_j^Tr_j \quad \text{and} \quad z_j := \frac{1}{n}X_j^TX_j $$
We can now minimise with respect to $\beta_j$:
$$ \min_{\beta_j} \left[\frac{1}{2}z_j\beta_j^2 - \rho_j\beta_j\right] + \lambda\left|\beta_j\right| $$
Differentiating with respect to $\beta_j$ and substituting the minimum value, we obtain:
$$ \left.\frac{\partial L}{\partial \beta_j}\right\vert_{\beta_j=\hat{\beta}_j} = z_j\hat{\beta}_j - \rho_j + \lambda , \text{sgn}\left(\hat{\beta}_j\right) = 0 $$
where
$$ \text{sgn}(x) = \begin{cases} 1, & x > 0\ 0, & x = 0\ -1, & x < 0 \end{cases} $$
We now consider three cases:
Case 1: $\hat{\beta}_j > 0$
$$ z_j\hat{\beta}_j - \rho_j + \lambda = 0 $$
$$ \Rightarrow \hat{\beta}_j = \frac{\rho_j - \lambda}{z_j}, \quad \text{valid when } \rho_j > \lambda $$
Case 2: $\hat{\beta}_j < 0$
$$ z_j\hat{\beta}_j - \rho_j - \lambda = 0 $$
$$ \Rightarrow \hat{\beta}_j = \frac{\rho_j + \lambda}{z_j}, \quad \text{valid when } \rho_j < -\lambda $$
Case 3: $\hat{\beta}_j = 0$
Since $\text{sgn}(x)$ is not differentiable at $x = 0$, we use its subgradient:
$$ \text{sgn}(0) \in [-1, 1] $$
Thus, for $\hat{\beta}_j = 0$ to hold, we require:
$$ 0 = -\rho_j + \lambda x \quad \text{where} \quad x \in [-1, 1] $$
$$ \Rightarrow \rho_j \in [-\lambda, \lambda] $$
Since we iterate through each column while holding other coefficients fixed, the update for each coefficient must be repeated until convergence or until a maximum number of iterations is reached.
Define the change in the coefficient for feature $h$ at iteration $k$:
$$ \delta^{(k+1)} := \hat{\beta}^{(k+1)}_h - \hat{\beta}^{(k)}_h $$
For the $i^{\text{th}}$ observation, define the residual:
$$ \varepsilon_i := y_i - \sum_{j=1}^{n} X_{ij}\hat{\beta}_j $$
The residuals are updated after each coefficient adjustment as:
$$ \varepsilon^{(k+1)}_i = \varepsilon^{(k)}i - \delta^{(k+1)} \cdot X{ij} $$
This ensures that at each iteration, the residuals reflect the most recent coefficient updates:
$$ y_i - \sum_{j \neq h} X_{ij}\hat{\beta}j - \beta^{(k+1)} X{ih} $$
as required.