river
river copied to clipboard
Estimated coefficient seem wrong for some scalers
creme
version: 0.6.1
Python version: 3.7.7
Operating system: Ubuntu 20.04.1 LTS
Estimated coefficients seem wrong for some scalers
I am currently doing simulations and the estimated weight seem wrong for feature generated from normal distribution using scalers such as MaxAbsScaler
or MinMaxScaler
.
Also :
- When generated from uniform distribution, it converges to the appropriate value (for MaxAbs and MinMax scalers)
- For
StandardScaler
the estimated weight does converge for feature generated from normal distribution.
Below is a graph summarizing the issue (using MinMaxScaler).
Reading the graph :
- x-axis is the number of iteration
- y-axis it the value of the coefficient/weight
- coeff_uniform: the value of the coeff when feature is generated from uniform distribution
- coeff_normal: the value of the coeff when feature is generated from normal distribution
Steps/code to reproduce
# Import
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from creme import (
compose,
linear_model,
metrics,
preprocessing,
optim
)
# Fun
def generate_pipeline(lr=0.1):
scaler = preprocessing.MinMaxScaler()
pipeline = compose.Pipeline(
scaler
)
pipeline |= linear_model.LinearRegression(optimizer=optim.SGD(lr=lr))
return pipeline
def generate_experiment(N, beta, random_fun, seed, model=None):
if not model:
model = generate_pipeline()
N = int(N)
np.random.seed(seed)
beta = beta
x_arr = random_fun(size=N)
noise = np.random.normal(size=N, scale=0.1)
y_arr = beta * x_arr + noise
# Storing coeffs
coeff_arr = np.empty(N)
metric = metrics.MSE()
for i, (x, y) in enumerate(zip(x_arr, y_arr)):
x = dict(x=x)
y_pred = model.predict_one(x)
metric.update(y, y_pred)
model.fit_one(x=x, y=y)
coeff_arr[i] = model["LinearRegression"].weights["x"]
print(metric, "for", " ".join(str(random_fun).split()[1:5]))
return model, coeff_arr
# Script
beta = 2
seed = 123
N = 1000
_, coeff_unif = generate_experiment(random_fun=np.random.uniform, N=N, beta=beta, seed=seed)
_, coeff_normal = generate_experiment(random_fun=np.random.normal, N=N, beta=beta, seed=seed)
# Plotting
df_coeff = pd.DataFrame({"coeff_uniform": coeff_unif, "coeff_normal": coeff_normal, "true_coeff": beta})
fig, ax = plt.subplots(figsize=(8, 5))
df_coeff.plot(ax=ax, style= ['b-','r-','k--'])
#fig.savefig('coeff_anomaly.png')
Do you think this is a bug or just a downside of min-max scaling? It's rarely recommended to use min-max scaling for linear models anyway.
I couldn't say whether it's a bug or if it's a downside of these scalers since I didn't delve into the internals of fit_one
.
After thinking, the problem is more the slow convergence with MinMaxScaler
than the final estimated coefficients (they do converge after 5000+ iterations).
The estimated coefficients for some scalers will indeed be different since the transformed feature range is different from the original one. In this regard, there is no problem with MaxAbsScaler
: the coefficients do converge for the normal distributed feature (not shown here).
However, the problem seems different for MinMaxScaler
, as the coefficients converge too slowly for normal distributed feature and that the MSE if an order of magnitude higher in this case.
Below is a plot for both the estimated intercept and the coefficient for MinMaxScaler
. What is happening is that they diverge on the opposite side. It seems that the estimated coefficient/weight make up for the wrong negative update of the intercept (if the coefficient is updated after the intercept), which is a bit strange to me.