river Estimated coefficient seem wrong for some scalers

creme version: 0.6.1 Python version: 3.7.7 Operating system: Ubuntu 20.04.1 LTS

Estimated coefficients seem wrong for some scalers

I am currently doing simulations and the estimated weight seem wrong for feature generated from normal distribution using scalers such as MaxAbsScaler or MinMaxScaler.

Also :

When generated from uniform distribution, it converges to the appropriate value (for MaxAbs and MinMax scalers)
For StandardScaler the estimated weight does converge for feature generated from normal distribution.

Below is a graph summarizing the issue (using MinMaxScaler).

Reading the graph :

x-axis is the number of iteration
y-axis it the value of the coefficient/weight
coeff_uniform: the value of the coeff when feature is generated from uniform distribution
coeff_normal: the value of the coeff when feature is generated from normal distribution

Steps/code to reproduce

# Import
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
from creme import (
    compose,
    linear_model,
    metrics,
    preprocessing,
    optim
)

# Fun
def generate_pipeline(lr=0.1):
    scaler = preprocessing.MinMaxScaler()

    pipeline = compose.Pipeline(
        scaler
    )
    pipeline |= linear_model.LinearRegression(optimizer=optim.SGD(lr=lr))
    
    return pipeline


def generate_experiment(N, beta, random_fun, seed, model=None):
    if not model:
        model = generate_pipeline()
    
    N = int(N)
    np.random.seed(seed)
    beta = beta
    x_arr = random_fun(size=N)
    noise = np.random.normal(size=N, scale=0.1)
    y_arr = beta * x_arr + noise

    # Storing coeffs
    coeff_arr = np.empty(N)

    metric = metrics.MSE()

    for i, (x, y) in enumerate(zip(x_arr, y_arr)):
        x = dict(x=x)
        
        y_pred = model.predict_one(x)
        metric.update(y, y_pred) 

        model.fit_one(x=x, y=y)

        coeff_arr[i] = model["LinearRegression"].weights["x"]
    
    print(metric, "for", " ".join(str(random_fun).split()[1:5]))
    
    return model, coeff_arr

# Script
beta = 2
seed = 123
N = 1000

_, coeff_unif = generate_experiment(random_fun=np.random.uniform, N=N, beta=beta, seed=seed)
_, coeff_normal = generate_experiment(random_fun=np.random.normal, N=N, beta=beta, seed=seed)

# Plotting
df_coeff = pd.DataFrame({"coeff_uniform": coeff_unif, "coeff_normal": coeff_normal, "true_coeff": beta})

fig, ax = plt.subplots(figsize=(8, 5))
df_coeff.plot(ax=ax, style= ['b-','r-','k--'])
#fig.savefig('coeff_anomaly.png')

Nov 10 '20 19:11 etiennekintzler

Do you think this is a bug or just a downside of min-max scaling? It's rarely recommended to use min-max scaling for linear models anyway.

Nov 10 '20 22:11 MaxHalford

I couldn't say whether it's a bug or if it's a downside of these scalers since I didn't delve into the internals of fit_one.

After thinking, the problem is more the slow convergence with MinMaxScaler than the final estimated coefficients (they do converge after 5000+ iterations).

The estimated coefficients for some scalers will indeed be different since the transformed feature range is different from the original one. In this regard, there is no problem with MaxAbsScaler: the coefficients do converge for the normal distributed feature (not shown here).

However, the problem seems different for MinMaxScaler, as the coefficients converge too slowly for normal distributed feature and that the MSE if an order of magnitude higher in this case.

Below is a plot for both the estimated intercept and the coefficient for MinMaxScaler. What is happening is that they diverge on the opposite side. It seems that the estimated coefficient/weight make up for the wrong negative update of the intercept (if the coefficient is updated after the intercept), which is a bit strange to me.

Nov 11 '20 11:11 etiennekintzler

river river copied to clipboard

Estimated coefficient seem wrong for some scalers

Estimated coefficients seem wrong for some scalers

Steps/code to reproduce

river
river copied to clipboard