skpro icon indicating copy to clipboard operation
skpro copied to clipboard

[BUG] `NGBoostRegressor` failing when `dist="TDistribution"`

Open ShreeshaM07 opened this issue 10 months ago • 4 comments

Describe the bug

In the gradent_boosting which has an interface of the NGBRegressor in skpro as NGBoostRegressor the TDistribution seems to be failing to run as expected. It is raising errors like

    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix

To Reproduce

Upon using sklearn's diabetes dataset and the breast_cancer dataset it is producing the same Singular Matrix error. To reproduce

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from skpro.regression.gradient_boosting import NGBoostRegressor


# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, y)
ngb = NGBoostRegressor(dist="TDistribution")._fit(X_train, Y_train)
Y_preds = ngb._predict(X_test)

Y_dists = ngb._pred_dist(X_test)

print(Y_dists)
Y_pred_proba = ngb.predict_proba(X_test)
print(Y_pred_proba)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

Expected behavior

The expected output must look something like this

[iter 0] loss=5.7260 val_loss=0.0000 scale=1.0000 norm=62.6096
[iter 100] loss=5.3862 val_loss=0.0000 scale=1.0000 norm=44.7994
[iter 200] loss=5.1347 val_loss=0.0000 scale=2.0000 norm=70.8354
[iter 300] loss=4.9709 val_loss=0.0000 scale=1.0000 norm=31.4283
[iter 400] loss=4.8448 val_loss=0.0000 scale=2.0000 norm=57.8725
<ngboost.distns.t.TDistribution object at 0x7a306649f010>
TDistribution(columns=Index(['target'], dtype='object'),
       index=Index([394,  76, 398, 154, 164, 409,  86,  57, 248, 252,
       ...
       337,  16, 115, 134, 158, 256, 315,   7, 292, 119],
      dtype='int64', length=111),
       mu=              0
0    204.242902
1    159.767290
2    180.299182
3    157.156834
4    132.029658
..          ...
106  207.598136
107  111.282266
108  142.690431
109   82.266164
110  144.789344

[111 rows x 1 columns],
       sigma=             0
0    22.784403
1    26.722443
2    41.334656
3    32.130065
4    23.862477
..         ...
106  31.425179
107  33.441920
108  24.632183
109  26.791969
110  34.908296

[111 rows x 1 columns])
Test MSE 4077.414567879142
Test NLL 6.473540253400317

Environment

Python 3.11.8 ngboost 0.5.1

Additional context

The issue is to find out whether there is an issue with the interfacing ie the skpro API or genuinely a bug in the ngboost TDistribution itself.

ShreeshaM07 avatar May 02 '24 11:05 ShreeshaM07