[Phase 2 Parity] Implement Ridge and Lasso Regression

Open ooples opened this issue 2 months ago • 1 comments

Problem: No L1/L2 regularized linear regression. Missing: Ridge (HIGH), Lasso (HIGH), ElasticNet (HIGH), RidgeCV, LassoCV, ElasticNetCV. Use Cases: Feature selection, multicollinearity handling, high-dimensional data. Goal: Efficient coordinate descent solver, alpha tuning, cross-validation integration.

Nov 07 '25 03:11 ooples

Junior Developer Implementation Guide: Issue #386

Overview

Issue: Ridge and Lasso Regression Goal: Implement regularized linear regression models (L2 and L1 regularization) Difficulty: Intermediate-Advanced Estimated Time: 10-14 hours

What You'll Be Building

You'll implement regularized linear regression models:

IRidgeRegression Interface - Specific to Ridge regression
ILassoRegression Interface - Specific to Lasso regression
RidgeRegression - L2 regularization (closed-form solution)
LassoRegression - L1 regularization (coordinate descent algorithm)
ElasticNetRegression - Combines L1 and L2 (bonus)
Comprehensive Unit Tests - 80%+ coverage

Understanding Regularized Regression

What is Regularization?

Regularization adds a penalty term to prevent overfitting by keeping model coefficients small.

Problem with Ordinary Least Squares (OLS):

Can overfit on training data
Large coefficients for correlated features
Unstable when features > samples
Poor generalization to new data

Solution: Add penalty for large coefficients

Ridge Regression (L2 Regularization)

Ridge adds the sum of squared coefficients as penalty.

Mathematical Formula:

Minimize: ||y - Xw||² + α * ||w||²

where:
    y = target values
    X = feature matrix
    w = coefficient weights
    α = regularization parameter (alpha)
    ||w||² = sum of squared coefficients (L2 norm)

Expanded:
    Loss = ∑(yi - ŷi)² + α * ∑(wj²)

    First term: Prediction error (MSE)
    Second term: Penalty for large coefficients

Closed-Form Solution:

w = (X^T X + α I)^(-1) X^T y

where:
    I = identity matrix
    Adding α I makes matrix invertible even when X^T X is singular

Properties:

Shrinks coefficients toward zero but never exactly to zero
All features retained (no feature selection)
Handles multicollinearity well
Computationally efficient (closed-form solution)

When to Use:

Many correlated features
Preventing overfitting
Regularization without feature selection
When you want all features in the model

Real-World Example: Predicting house prices with 100 correlated features (size, rooms, location metrics). Ridge keeps all features but prevents any single feature from dominating.

Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) adds the sum of absolute coefficients as penalty.

Mathematical Formula:

Minimize: ||y - Xw||² + α * ||w||₁

where:
    ||w||₁ = sum of absolute values of coefficients (L1 norm)

Expanded:
    Loss = ∑(yi - ŷi)² + α * ∑|wj|

Properties:

Shrinks coefficients toward zero
Sets some coefficients exactly to zero (feature selection!)
Produces sparse models
Requires iterative optimization (no closed-form solution)
Common algorithm: Coordinate Descent

When to Use:

Feature selection needed (sparse models)
Many irrelevant features
Interpretable models desired
You suspect only few features are truly important

Real-World Example: Gene expression analysis with 10,000 genes but only ~100 are relevant. Lasso automatically identifies and keeps only the important genes.

ElasticNet (L1 + L2)

ElasticNet combines Ridge and Lasso penalties.

Mathematical Formula:

Minimize: ||y - Xw||² + α * (ρ * ||w||₁ + (1-ρ) * ||w||²)

where:
    ρ = mixing parameter (0 to 1)
    ρ = 0: Ridge regression
    ρ = 1: Lasso regression
    0 < ρ < 1: Elastic Net

When to Use:

Features are correlated AND you want feature selection
Lasso is too aggressive (eliminates too many features)
Best of both worlds: feature selection + handling multicollinearity

Comparison Table

Aspect	Ridge (L2)	Lasso (L1)	ElasticNet
Penalty	Sum of squares	Sum of absolute	Both
Feature Selection	No (keeps all)	Yes (sets to 0)	Yes
Solution	Closed-form	Iterative	Iterative
Speed	Fast	Slower	Slower
Multicollinearity	Excellent	Poor	Good
Sparsity	No	Yes	Yes

Understanding the Codebase

Architecture Pattern

IRegression<T>                  (Base regression interface)
    ↓
RegressionBase<T>               (Common regression logic - may exist)
    ↓
┌──────────────┬──────────────┬────────────────┐
RidgeRegression  LassoRegression  ElasticNetRegression
(L2 - closed)   (L1 - iterative)  (L1+L2 - iterative)

Step-by-Step Implementation Guide

Step 1: Create RidgeRegression

Create file: C:\Users\cheat\source\repos\AiDotNet\src\Regression\RidgeRegression.cs

using AiDotNet.Interfaces;
using AiDotNet.LinearAlgebra;
using AiDotNet.Helpers;

namespace AiDotNet.Regression;

/// <summary>
/// Implements Ridge Regression with L2 regularization.
/// </summary>
/// <typeparam name="T">The numeric type used for calculations.</typeparam>
/// <remarks>
/// <para>
/// Ridge Regression adds an L2 penalty (sum of squared coefficients) to the ordinary
/// least squares objective function. This regularization prevents overfitting by
/// shrinking coefficient estimates toward zero.
/// </para>
/// <para><b>For Beginners:</b> Ridge Regression is like ordinary linear regression
/// but with a constraint that prevents coefficients from getting too large.
///
/// Think of it like this:
/// - Ordinary regression: Find the line that fits data best (no matter how)
/// - Ridge regression: Find the line that fits data well AND keeps coefficients reasonable
///
/// Why is this useful?
/// - Prevents overfitting (model too complex for training data)
/// - Handles correlated features better
/// - More stable predictions on new data
/// - Works even when you have more features than samples
///
/// The alpha parameter controls the trade-off:
/// - alpha = 0: Ordinary least squares (no regularization)
/// - Small alpha (0.01): Light regularization
/// - Large alpha (100): Heavy regularization, coefficients shrink toward zero
///
/// Real-world example: Predicting house prices with many correlated features
/// (square footage, number of rooms, lot size). Ridge prevents any single
/// feature from dominating the prediction.
/// </para>
/// </remarks>
public class RidgeRegression<T> : IRegression<T>
{
    private static readonly INumericOperations<T> NumOps = MathHelper.GetNumericOperations<T>();

    /// <summary>
    /// The regularization parameter (alpha).
    /// </summary>
    /// <remarks>
    /// <para><b>For Beginners:</b> Alpha controls how much regularization to apply.
    ///
    /// - alpha = 0: No regularization (standard linear regression)
    /// - Small alpha (0.01 - 1.0): Light regularization
    /// - Medium alpha (1.0 - 10.0): Moderate regularization
    /// - Large alpha (10.0 - 100.0): Heavy regularization
    ///
    /// Higher alpha = smaller coefficients = simpler model = less overfitting
    /// But too high alpha can cause underfitting (model too simple).
    ///
    /// Typical approach: Try values [0.01, 0.1, 1, 10, 100] and use cross-validation
    /// to find the best alpha for your data.
    /// </para>
    /// </remarks>
    public T Alpha { get; set; }

    /// <summary>
    /// Whether to fit an intercept term.
    /// </summary>
    public bool FitIntercept { get; set; }

    /// <summary>
    /// The learned coefficient weights.
    /// </summary>
    public Vector<T>? Coefficients { get; private set; }

    /// <summary>
    /// The intercept term (bias).
    /// </summary>
    public T Intercept { get; private set; }

    /// <summary>
    /// Creates a new Ridge Regression model.
    /// </summary>
    /// <param name="alpha">The regularization parameter (default: 1.0).</param>
    /// <param name="fitIntercept">Whether to fit an intercept (default: true).</param>
    public RidgeRegression(T? alpha = null, bool fitIntercept = true)
    {
        Alpha = alpha ?? NumOps.One;
        FitIntercept = fitIntercept;
        Intercept = NumOps.Zero;
    }

    /// <summary>
    /// Trains the Ridge Regression model using the closed-form solution.
    /// </summary>
    /// <param name="X">The feature matrix where each row is a sample.</param>
    /// <param name="y">The target values.</param>
    /// <remarks>
    /// <para><b>For Beginners:</b> Training finds the best coefficients using a mathematical formula.
    ///
    /// The algorithm:
    /// 1. If fit_intercept is true, center the data (subtract means)
    /// 2. Compute: w = (X^T X + α I)^(-1) X^T y
    /// 3. If fit_intercept is true, compute intercept from means
    ///
    /// The key insight: Adding α I to X^T X ensures the matrix is invertible,
    /// even when X has more columns than rows or has perfectly correlated features.
    ///
    /// This is a "closed-form" solution, meaning we can compute it directly
    /// without iterative optimization. This makes Ridge very fast to train.
    /// </para>
    /// </remarks>
    public void Fit(Matrix<T> X, Vector<T> y)
    {
        if (X.Rows != y.Length)
            throw new ArgumentException("Number of samples in X must match length of y");

        int n = X.Rows;
        int p = X.Columns;

        Matrix<T> XCentered = X;
        Vector<T> yCentered = y;
        Vector<T>? xMeans = null;
        T yMean = NumOps.Zero;

        // Center data if fitting intercept
        if (FitIntercept)
        {
            xMeans = new Vector<T>(p);
            XCentered = new Matrix<T>(n, p);

            // Calculate feature means
            for (int j = 0; j < p; j++)
            {
                T sum = NumOps.Zero;
                for (int i = 0; i < n; i++)
                {
                    sum = NumOps.Add(sum, X[i, j]);
                }
                xMeans[j] = NumOps.Divide(sum, NumOps.FromInt(n));
            }

            // Calculate target mean
            T sumY = NumOps.Zero;
            for (int i = 0; i < n; i++)
            {
                sumY = NumOps.Add(sumY, y[i]);
            }
            yMean = NumOps.Divide(sumY, NumOps.FromInt(n));

            // Center X and y
            for (int i = 0; i < n; i++)
            {
                for (int j = 0; j < p; j++)
                {
                    XCentered[i, j] = NumOps.Subtract(X[i, j], xMeans[j]);
                }
            }

            yCentered = new Vector<T>(n);
            for (int i = 0; i < n; i++)
            {
                yCentered[i] = NumOps.Subtract(y[i], yMean);
            }
        }

        // Compute X^T X
        Matrix<T> XtX = MatrixHelper<T>.Multiply(
            MatrixHelper<T>.Transpose(XCentered),
            XCentered
        );

        // Add α I to diagonal (regularization)
        for (int i = 0; i < p; i++)
        {
            XtX[i, i] = NumOps.Add(XtX[i, i], Alpha);
        }

        // Compute X^T y
        Vector<T> Xty = new Vector<T>(p);
        for (int j = 0; j < p; j++)
        {
            T sum = NumOps.Zero;
            for (int i = 0; i < n; i++)
            {
                sum = NumOps.Add(sum, NumOps.Multiply(XCentered[i, j], yCentered[i]));
            }
            Xty[j] = sum;
        }

        // Solve (X^T X + α I) w = X^T y
        // Using matrix inversion (can be replaced with more efficient solver)
        Matrix<T> XtXInv = MatrixHelper<T>.Invert(XtX);
        Coefficients = MatrixHelper<T>.MultiplyVector(XtXInv, Xty);

        // Compute intercept if needed
        if (FitIntercept)
        {
            // intercept = y_mean - w^T x_means
            T intercept = yMean;
            for (int j = 0; j < p; j++)
            {
                intercept = NumOps.Subtract(intercept,
                    NumOps.Multiply(Coefficients[j], xMeans![j]));
            }
            Intercept = intercept;
        }
        else
        {
            Intercept = NumOps.Zero;
        }
    }

    /// <summary>
    /// Predicts target values for new data.
    /// </summary>
    /// <param name="X">The feature matrix to predict for.</param>
    /// <returns>A vector of predicted values.</returns>
    public Vector<T> Predict(Matrix<T> X)
    {
        if (Coefficients == null)
            throw new InvalidOperationException("Model must be trained before prediction");

        if (X.Columns != Coefficients.Length)
            throw new ArgumentException($"Expected {Coefficients.Length} features, got {X.Columns}");

        var predictions = new Vector<T>(X.Rows);

        for (int i = 0; i < X.Rows; i++)
        {
            T pred = Intercept;
            for (int j = 0; j < X.Columns; j++)
            {
                pred = NumOps.Add(pred, NumOps.Multiply(Coefficients[j], X[i, j]));
            }
            predictions[i] = pred;
        }

        return predictions;
    }

    /// <summary>
    /// Calculates the R² score (coefficient of determination).
    /// </summary>
    public T Score(Matrix<T> X, Vector<T> y)
    {
        var predictions = Predict(X);

        // Calculate mean of y
        T yMean = NumOps.Zero;
        for (int i = 0; i < y.Length; i++)
        {
            yMean = NumOps.Add(yMean, y[i]);
        }
        yMean = NumOps.Divide(yMean, NumOps.FromInt(y.Length));

        // Calculate SS_tot and SS_res
        T ssTot = NumOps.Zero;
        T ssRes = NumOps.Zero;

        for (int i = 0; i < y.Length; i++)
        {
            T residual = NumOps.Subtract(y[i], predictions[i]);
            ssRes = NumOps.Add(ssRes, NumOps.Multiply(residual, residual));

            T deviation = NumOps.Subtract(y[i], yMean);
            ssTot = NumOps.Add(ssTot, NumOps.Multiply(deviation, deviation));
        }

        // R² = 1 - (SS_res / SS_tot)
        return NumOps.Subtract(NumOps.One, NumOps.Divide(ssRes, ssTot));
    }
}

Step 2: Create LassoRegression with Coordinate Descent

Create file: C:\Users\cheat\source\repos\AiDotNet\src\Regression\LassoRegression.cs

using AiDotNet.Interfaces;
using AiDotNet.LinearAlgebra;
using AiDotNet.Helpers;

namespace AiDotNet.Regression;

/// <summary>
/// Implements Lasso Regression with L1 regularization using Coordinate Descent.
/// </summary>
/// <typeparam name="T">The numeric type used for calculations.</typeparam>
/// <remarks>
/// <para>
/// Lasso Regression adds an L1 penalty (sum of absolute coefficient values) to the
/// ordinary least squares objective. This penalty encourages sparsity, setting many
/// coefficients exactly to zero, thus performing automatic feature selection.
/// </para>
/// <para><b>For Beginners:</b> Lasso is like Ridge but with a key difference:
/// it can set coefficients exactly to zero, effectively removing features.
///
/// Think of it like cleaning out a closet:
/// - Ridge: Makes all items smaller (shrinks everything toward zero)
/// - Lasso: Throws away items you don't need (sets some coefficients to zero)
///
/// Why is this useful?
/// - Automatic feature selection (identifies important features)
/// - Creates simple, interpretable models
/// - Works well when many features are irrelevant
/// - Produces sparse models (many zero coefficients)
///
/// Example: Predicting disease from 10,000 genes
/// - Most genes are probably irrelevant
/// - Lasso automatically identifies the ~100 important genes
/// - Sets coefficients of irrelevant genes to exactly zero
///
/// The alpha parameter controls sparsity:
/// - Small alpha: Keeps more features (less sparse)
/// - Large alpha: Removes more features (more sparse)
/// </para>
/// </remarks>
public class LassoRegression<T> : IRegression<T>
{
    private static readonly INumericOperations<T> NumOps = MathHelper.GetNumericOperations<T>();

    /// <summary>
    /// The regularization parameter (alpha).
    /// </summary>
    public T Alpha { get; set; }

    /// <summary>
    /// Whether to fit an intercept term.
    /// </summary>
    public bool FitIntercept { get; set; }

    /// <summary>
    /// Maximum number of iterations for coordinate descent.
    /// </summary>
    public int MaxIterations { get; set; }

    /// <summary>
    /// Convergence tolerance for coordinate descent.
    /// </summary>
    public T Tolerance { get; set; }

    /// <summary>
    /// The learned coefficient weights.
    /// </summary>
    public Vector<T>? Coefficients { get; private set; }

    /// <summary>
    /// The intercept term (bias).
    /// </summary>
    public T Intercept { get; private set; }

    /// <summary>
    /// Creates a new Lasso Regression model.
    /// </summary>
    /// <param name="alpha">The regularization parameter (default: 1.0).</param>
    /// <param name="fitIntercept">Whether to fit an intercept (default: true).</param>
    /// <param name="maxIterations">Maximum iterations (default: 1000).</param>
    /// <param name="tolerance">Convergence tolerance (default: 1e-4).</param>
    public LassoRegression(T? alpha = null, bool fitIntercept = true,
        int maxIterations = 1000, T? tolerance = null)
    {
        Alpha = alpha ?? NumOps.One;
        FitIntercept = fitIntercept;
        MaxIterations = maxIterations;
        Tolerance = tolerance ?? NumOps.FromDouble(1e-4);
        Intercept = NumOps.Zero;
    }

    /// <summary>
    /// Trains the Lasso Regression model using Coordinate Descent.
    /// </summary>
    /// <remarks>
    /// <para><b>For Beginners:</b> Coordinate Descent is an iterative optimization algorithm.
    ///
    /// Unlike Ridge (which has a closed-form solution), Lasso requires iteration because
    /// the L1 penalty makes the optimization problem non-differentiable at zero.
    ///
    /// How Coordinate Descent works:
    /// 1. Initialize all coefficients to zero (or random values)
    /// 2. Repeat until convergence:
    ///    a. For each coefficient:
    ///       - Fix all other coefficients
    ///       - Update this coefficient using soft-thresholding
    ///    b. Check if changes are smaller than tolerance
    /// 3. Stop when converged or max iterations reached
    ///
    /// Soft-thresholding is the key operation:
    /// - If |update| < α: Set coefficient to 0 (feature removed)
    /// - If update > α: Set coefficient to (update - α)
    /// - If update < -α: Set coefficient to (update + α)
    ///
    /// This is what creates sparsity (zero coefficients).
    /// </para>
    /// </remarks>
    public void Fit(Matrix<T> X, Vector<T> y)
    {
        if (X.Rows != y.Length)
            throw new ArgumentException("Number of samples in X must match length of y");

        int n = X.Rows;
        int p = X.Columns;

        Matrix<T> XCentered = X;
        Vector<T> yCentered = y;
        Vector<T>? xMeans = null;
        T yMean = NumOps.Zero;

        // Center and normalize data if fitting intercept
        if (FitIntercept)
        {
            xMeans = new Vector<T>(p);
            XCentered = new Matrix<T>(n, p);

            // Calculate feature means
            for (int j = 0; j < p; j++)
            {
                T sum = NumOps.Zero;
                for (int i = 0; i < n; i++)
                {
                    sum = NumOps.Add(sum, X[i, j]);
                }
                xMeans[j] = NumOps.Divide(sum, NumOps.FromInt(n));
            }

            // Calculate target mean
            T sumY = NumOps.Zero;
            for (int i = 0; i < n; i++)
            {
                sumY = NumOps.Add(sumY, y[i]);
            }
            yMean = NumOps.Divide(sumY, NumOps.FromInt(n));

            // Center X and y
            for (int i = 0; i < n; i++)
            {
                for (int j = 0; j < p; j++)
                {
                    XCentered[i, j] = NumOps.Subtract(X[i, j], xMeans[j]);
                }
            }

            yCentered = new Vector<T>(n);
            for (int i = 0; i < n; i++)
            {
                yCentered[i] = NumOps.Subtract(y[i], yMean);
            }
        }

        // Initialize coefficients to zero
        Coefficients = new Vector<T>(p);

        // Precompute X column norms squared
        var xNormsSquared = new Vector<T>(p);
        for (int j = 0; j < p; j++)
        {
            T sum = NumOps.Zero;
            for (int i = 0; i < n; i++)
            {
                sum = NumOps.Add(sum, NumOps.Multiply(XCentered[i, j], XCentered[i, j]));
            }
            xNormsSquared[j] = sum;
        }

        // Coordinate Descent
        for (int iter = 0; iter < MaxIterations; iter++)
        {
            T maxChange = NumOps.Zero;

            for (int j = 0; j < p; j++)
            {
                T oldCoef = Coefficients[j];

                // Compute residual excluding feature j
                // r = y - X_(-j) * w_(-j)
                var residual = new Vector<T>(n);
                for (int i = 0; i < n; i++)
                {
                    T pred = NumOps.Zero;
                    for (int k = 0; k < p; k++)
                    {
                        if (k != j)
                        {
                            pred = NumOps.Add(pred,
                                NumOps.Multiply(XCentered[i, k], Coefficients[k]));
                        }
                    }
                    residual[i] = NumOps.Subtract(yCentered[i], pred);
                }

                // Compute correlation: X_j^T * residual
                T correlation = NumOps.Zero;
                for (int i = 0; i < n; i++)
                {
                    correlation = NumOps.Add(correlation,
                        NumOps.Multiply(XCentered[i, j], residual[i]));
                }

                // Soft-thresholding update
                Coefficients[j] = SoftThreshold(correlation, Alpha);

                // Normalize by X_j^T X_j
                if (NumOps.GreaterThan(xNormsSquared[j], NumOps.Zero))
                {
                    Coefficients[j] = NumOps.Divide(Coefficients[j], xNormsSquared[j]);
                }

                // Track maximum change for convergence
                T change = NumOps.Abs(NumOps.Subtract(Coefficients[j], oldCoef));
                if (NumOps.GreaterThan(change, maxChange))
                {
                    maxChange = change;
                }
            }

            // Check convergence
            if (NumOps.LessThan(maxChange, Tolerance))
            {
                break;
            }
        }

        // Compute intercept if needed
        if (FitIntercept)
        {
            T intercept = yMean;
            for (int j = 0; j < p; j++)
            {
                intercept = NumOps.Subtract(intercept,
                    NumOps.Multiply(Coefficients[j], xMeans![j]));
            }
            Intercept = intercept;
        }
        else
        {
            Intercept = NumOps.Zero;
        }
    }

    /// <summary>
    /// Applies soft-thresholding operator for L1 penalty.
    /// </summary>
    /// <remarks>
    /// <para><b>For Beginners:</b> Soft-thresholding is the magic behind Lasso's sparsity.
    ///
    /// Formula:
    /// - If x > α: return (x - α)
    /// - If x < -α: return (x + α)
    /// - If |x| <= α: return 0
    ///
    /// This creates a "dead zone" around zero where coefficients are set to exactly zero.
    /// This is how Lasso performs feature selection - coefficients below the threshold
    /// become zero and those features are removed from the model.
    /// </para>
    /// </remarks>
    private T SoftThreshold(T x, T threshold)
    {
        if (NumOps.GreaterThan(x, threshold))
        {
            return NumOps.Subtract(x, threshold);
        }
        else if (NumOps.LessThan(x, NumOps.Negate(threshold)))
        {
            return NumOps.Add(x, threshold);
        }
        else
        {
            return NumOps.Zero;
        }
    }

    /// <summary>
    /// Predicts target values for new data.
    /// </summary>
    public Vector<T> Predict(Matrix<T> X)
    {
        if (Coefficients == null)
            throw new InvalidOperationException("Model must be trained before prediction");

        if (X.Columns != Coefficients.Length)
            throw new ArgumentException($"Expected {Coefficients.Length} features, got {X.Columns}");

        var predictions = new Vector<T>(X.Rows);

        for (int i = 0; i < X.Rows; i++)
        {
            T pred = Intercept;
            for (int j = 0; j < X.Columns; j++)
            {
                pred = NumOps.Add(pred, NumOps.Multiply(Coefficients[j], X[i, j]));
            }
            predictions[i] = pred;
        }

        return predictions;
    }

    /// <summary>
    /// Calculates the R² score.
    /// </summary>
    public T Score(Matrix<T> X, Vector<T> y)
    {
        var predictions = Predict(X);

        T yMean = NumOps.Zero;
        for (int i = 0; i < y.Length; i++)
        {
            yMean = NumOps.Add(yMean, y[i]);
        }
        yMean = NumOps.Divide(yMean, NumOps.FromInt(y.Length));

        T ssTot = NumOps.Zero;
        T ssRes = NumOps.Zero;

        for (int i = 0; i < y.Length; i++)
        {
            T residual = NumOps.Subtract(y[i], predictions[i]);
            ssRes = NumOps.Add(ssRes, NumOps.Multiply(residual, residual));

            T deviation = NumOps.Subtract(y[i], yMean);
            ssTot = NumOps.Add(ssTot, NumOps.Multiply(deviation, deviation));
        }

        return NumOps.Subtract(NumOps.One, NumOps.Divide(ssRes, ssTot));
    }

    /// <summary>
    /// Gets the number of non-zero coefficients (sparsity indicator).
    /// </summary>
    public int NonZeroCoefficients
    {
        get
        {
            if (Coefficients == null) return 0;

            int count = 0;
            for (int i = 0; i < Coefficients.Length; i++)
            {
                if (!NumOps.Equals(Coefficients[i], NumOps.Zero))
                {
                    count++;
                }
            }
            return count;
        }
    }
}

Test Coverage Checklist

Ridge Regression:

[ ] Perfect fit on simple linear data
[ ] Coefficient shrinkage (compare to OLS)
[ ] Handles multicollinearity (correlated features)
[ ] Works with more features than samples
[ ] Alpha = 0 gives OLS solution
[ ] Large alpha shrinks coefficients toward zero
[ ] R² score calculation

Lasso Regression:

[ ] Sets some coefficients exactly to zero
[ ] Feature selection (sparsity)
[ ] Convergence of coordinate descent
[ ] Soft-thresholding behavior
[ ] Handles irrelevant features
[ ] Compare sparsity at different alpha values
[ ] NonZeroCoefficients property

Common Mistakes to Avoid

Not centering data: Can cause intercept issues
Forgetting to normalize features: Lasso is scale-sensitive
Wrong alpha range: Try logarithmic scale (0.001, 0.01, 0.1, 1, 10, 100)
Not checking convergence: Coordinate descent needs sufficient iterations
Comparing coefficients between Ridge and Lasso directly: Different scales
Using Lasso when features are highly correlated: Ridge is better

Learning Resources

Ridge Regression: https://en.wikipedia.org/wiki/Ridge_regression
Lasso Regression: https://en.wikipedia.org/wiki/Lasso_(statistics)
Coordinate Descent: https://en.wikipedia.org/wiki/Coordinate_descent
Regularization: https://scikit-learn.org/stable/modules/linear_model.html

Validation Criteria

Ridge has closed-form solution (fast)
Lasso uses coordinate descent (iterative)
Lasso produces sparse solutions (zero coefficients)
Both handle regularization parameter alpha correctly
Test coverage 80%+
Proper intercept handling

Good luck! Ridge and Lasso are fundamental regularization techniques used in almost every ML pipeline.

Nov 07 '25 04:11 ooples