pyglmnet icon indicating copy to clipboard operation
pyglmnet copied to clipboard

pyglmnet returns only nan as coefficients (beta_ and beta0_)

Open erik-jan-climate opened this issue 5 years ago • 23 comments

Python 3.7.3 (default, Mar 27 2019, 22:11:17)

Erik Phone +49one51twotwo887293

All code is run in ipython

I followed this: https://glm-tools.github.io/pyglmnet/auto_examples/plot_group_lasso.html#sphx-glr-auto-examples-plot-group-lasso-py

bash: ./pip install pyglmnet

#in ipython from pyglmnet import GLMCV

#Input data X and y are stored in Pandas Dataframes which I then convert (see below): X= T850__mean T_2M__mean TD_2M__mean 0 275.155975 277.692291 276.955231 1 278.189697 284.367920 281.938110 2 270.829437 276.190979 272.350861 3 270.744324 277.318665 273.140411 4 273.041565 277.693665 274.393768 5 268.639587 274.262177 272.898468 6 269.230347 274.948334 271.464417 7 268.686707 276.265198 274.291565 8 267.068115 275.395508 273.623657 9 268.885742 277.743134 276.271698 10 277.509430 277.701263 275.212555 11 275.660004 279.134888 278.243286 12 266.246857 275.520111 273.073242 13 266.267181 274.186127 272.268036 14 265.423065 272.023621 269.310791 15 267.223267 271.640320 267.889038

y= obs 0 1.0 1 1.0 2 1.0 3 1.0 4 2.5 5 1.8 6 1.0 7 1.1 8 1.0 9 1.0 10 1.0 11 1.0 12 1.3 13 1.0 14 1.0 15 1.0

glm = GLMCV(distr="binomial", tol=1e-3, score_metric="pseudo_R2", alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True)

X.to_numpy()= array([[275.15598, 277.6923 , 276.95523], [278.1897 , 284.36792, 281.9381 ], [270.82944, 276.19098, 272.35086], [270.74432, 277.31866, 273.1404 ], [273.04156, 277.69366, 274.39377], [268.6396 , 274.26218, 272.89847], [269.23035, 274.94833, 271.46442], [268.6867 , 276.2652 , 274.29156], [267.0681 , 275.3955 , 273.62366], [268.88574, 277.74313, 276.2717 ], [277.50943, 277.70126, 275.21255], [275.66 , 279.1349 , 278.2433 ], [266.24686, 275.5201 , 273.07324], [266.26718, 274.18613, 272.26804], [265.42307, 272.02362, 269.3108 ], [267.22327, 271.64032, 267.88904]], dtype=float32)

y.to_numpy().flatten()= array([1. , 1. , 1. , 1. , 2.5, 1.8, 1. , 1.1, 1. , 1. , 1. , 1. , 1.3, 1. , 1. , 1. ])

g=glm.fit(X.to_numpy(),y.to_numpy().flatten())

ipython: In [35]: g.beta_ Out[35]: array([nan, nan, nan])

Problem:
nan everywhere. What is my mistake?

So far, I have increased input data up to 7000 x 10 values for X. and 7000 for y. Nothing changed. Still nan everywhere.

erik-jan-climate avatar Feb 21 '20 20:02 erik-jan-climate

It's a bit hard to help based on this. Could you perhaps provide a full script that reproduces the problem and share a small part of your data along with it? thanks

jasmainak avatar Feb 21 '20 20:02 jasmainak

To jasmainak: Can I call you? I have shared a small part of the data already. It is stored in MB-to-GB-large binary (netcdfs).

erik-jan-climate avatar Feb 21 '20 20:02 erik-jan-climate

Here is a more minimal version of what causes the nans: X and y are pandas Dataframes, X.shape= (9,2), y.shape()=(9,1)

X=   
  T850__mean  T_2M__mean
0  275.155975  277.692291
1  278.189697  284.367920
2  270.829437  276.190979
3  270.744324  277.318665
4  273.041565  277.693665
5  268.639587  274.262177
6  269.230347  274.948334
7  268.686707  276.265198
8  267.068115  275.395508

y=
   obs
0  0.0
1  0.0
2  0.0
3  0.0
4  1.5
5  0.8
6  0.0
7  0.1
8  0.0

I copied the glm setup (next line) from: https://glm-tools.github.io/pyglmnet/auto_examples/plot_group_lasso.html#sphx-glr-auto-examples-plot-group-lasso-py

glm = GLMCV(distr="gamma", tol=1e-3, 
            score_metric="pseudo_R2", alpha=1.0,  
            learning_rate=3, max_iter=100, cv=3, verbose=True)

g = glm.fit(X.to_numpy(),y.to_numpy().flatten())

g.beta_ 
returns: 
array([nan, nan])

erik-jan-climate avatar Feb 21 '20 20:02 erik-jan-climate

In fact, I am rather sure that one can simply make up any 10 numbers and store them in X. And any random numbers for y. After fitting, the coefficients will be nans.

erik-jan-climate avatar Feb 21 '20 20:02 erik-jan-climate

As I said above, please provide a full reproducible script that I can copy-paste. Happy to take it from there :) Thanks!

jasmainak avatar Feb 21 '20 20:02 jasmainak

Here you are:

import numpy as np 
import pandas as pd 
from pyglmnet import GLMCV
glm = GLMCV(distr="gamma", tol=1e-3, 
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True)

X = pd.DataFrame(np.random.randint(0, 100, size=(6, 2)), columns=list('AB'))
y = pd.DataFrame(np.random.randint(0, 10, size=(6, 1)), columns=list('y'))
dum2 = glm.fit(X.to_numpy(), y.to_numpy().flatten())
dum2.beta_

erik-jan-climate avatar Feb 21 '20 21:02 erik-jan-climate

To jasmainak: Is this script now something you can work with?

erik-jan-climate avatar Feb 21 '20 21:02 erik-jan-climate

You have set the learning rate too high, set it to 1e-8 and it should work

jasmainak avatar Feb 21 '20 21:02 jasmainak

I did the same with my real data (4 sample predictors) and got:

print(g.beta_) 
[0. 0. 0. 0.]
print(g.beta0_)                                                                      
0.06673483228915175

My ytrain is always small and equal (90%) or larger(10%) zero: eg.[ 0., 0., 0., 0., 0., 0., 0.01, 0., 0.3, 0, 0, 0.15, 0, 0, 0, ...]

The result beta_ = [0. 0. 0. 0.] seems to be wrong. I expect them to be very small but different from zero, for example [0.0000247, 0.004, 0.001, 0.006]

I played a bit by providing more predictors or more values per predictor and still got these return values.

Before preparing my next minimal example with real data: is it possible that the beta_ are truncated when printing them? Once, I also got [-0. -0. -0. -0.] which is why I wonder how I can display all (possibly cut off?) digits of these coefficients?

erik-jan-climate avatar Feb 22 '20 09:02 erik-jan-climate

Download this file first and extract it in your python working dir: py.zip

import sys
import numpy as np
import pickle
import pandas as pd
import datetime # Python standard library datetime module
import random 
import copy
from pyglmnet import GLMCV
from pyglmnet.datasets import fetch_group_lasso_datasets
from IPython.core.interactiveshell import InteractiveShell


pd.set_option('display.max_rows', 1000) #_pandas: Print 1000 rows, do not truncate
np.set_printoptions(threshold=np.inf)
np.set_printoptions(precision=16)
np.set_printoptions(suppress=False)
#np.set_printoptions(edgeitems=3,infstr='inf', linewidth=75, nanstr='nan', precision=8, suppress=False, threshold=1000, formatter=None)

InteractiveShell.ast_node_interactivity = "all"  #_output all called vars, not only the one that was called as the last


X = pickle.load( open( "X.pickle", "rb" ) )
y = pickle.load( open( "y.pickle", "rb" ) )


glm = GLMCV(distr="gamma", tol=1e-3, score_metric="pseudo_R2", alpha=1.0, learning_rate=1e-8, max_iter=100, cv=3, verbose=True)

g = glm.fit(X.to_numpy(),y.to_numpy().flatten())
g.beta_
g.beta0_

erik-jan-climate avatar Feb 22 '20 11:02 erik-jan-climate

The shape of beta depends on the shape of your input data. There is no truncation per se, but if you have too many values in an array, it may not display them. This does not appear to be the case for you.

I edited your post above so it's formatted as a python script. However, I can't run it by copy-pasting. If you can do that, I can help you further :)

jasmainak avatar Feb 22 '20 16:02 jasmainak

I can run it. Up to which line can you run it by copy-pasting? You need to download the zip and extract it to then pickle.load the input values.
As this input data is large (and will be larger soon), a pickle file is the best way I know to get the data to you. If there is a better way to automatically save the data contained in a pandas dataframe and send it to you, let me know.
I edited above py-script to make it even more clear that the download+extraction is not part of the script but need to be done first. I thought this was obvious. :)

erik-jan-climate avatar Feb 22 '20 18:02 erik-jan-climate

great, I can replicate now :) I edited your script to remove the netcdf part as it's not needed.

I need to ponder on this a bit. In the meanwhile, do you have an intuition as to why you'd expect the betas to be different from 0.?

jasmainak avatar Feb 22 '20 23:02 jasmainak

Because all 0. means the y data can be best predicted by beta0_ alone, which is a simple constant. But the y values vary between many different values. So it cannot be a simple constant. Also, when providing 10times the amount of X and y values, all betas are still 0. I ran the model 10times, and always used different dimensions (=number of records/samples) for X and y. glmcv always returned betas = 0. As the input data X and y result from independent measurements and predictions, the data cannot be by chance so "perfect, strange" that all betas can only be zero. This would be like a 1 in 1 trillion chance to just meet this beta configuration by chance with this input data.

erik-jan-climate avatar Feb 23 '20 01:02 erik-jan-climate

@erik-jan-climate i had a brief look at your data and code. a few comments to help you model this the right way.

  • 15/19 samples in y are zeros so i wouldn't be too surprised if a well tuned model tries to capture all the variance in the intercept term.
  • the columns in X are of vastly different scales. i would recommend z-scaling the data. you can do this with sklearn.preprocessing.StandardScaler before fitting a model.
  • you are using alpha=1.0 which is equivalent to Lasso regression (L1 penalty). i would not advise this with only 4 predictors in X. Lasso should be used when you have a large number of predictors and you expect many of them not to have any predictive power.

i did a sanity check that if i model your data with linear regression, GLM and sklearn.linear_model.LinearRegression gives equivalent results.

try this:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from pyglmnet import GLM

# Scale the data
sc = StandardScaler()
Xsc = sc.fit_transform(X.to_numpy())
ysc = y.to_numpy().flatten()

# Fit a Gaussian GLM without regularization
glm = GLM(distr="gaussian",
                    tol=1e-5,
                    score_metric="pseudo_R2",
                    alpha=0.,
                    reg_lambda=0.,
                    learning_rate=2e-1,
                    max_iter=1000, verbose=True)
glm.fit(Xsc, ysc)

# Fit a linear regression model (should produce equivalent results as above)
model = LinearRegression()
model.fit(Xsc, ysc)

# Print the model coefficients
print(glm.beta0_, glm.beta_)
print(model.intercept_, model.coef_)

if you'd like to model your data with a Gamma distribution, feel free to try but make sure you play around with reg_lambda, alpha, max_iter, tol, learning_rate. thanks for your question and happy data science!

pavanramkumar avatar Feb 23 '20 03:02 pavanramkumar

@jasmainak one learning for us from here is that i had to increase max_iter and decrease tol wrt the defaults in order to get equivalent results to sklearn's LinearRegression. perhaps we can revisit the defaults and incorporate those into a test.

pavanramkumar avatar Feb 23 '20 03:02 pavanramkumar

you are using alpha=1.0 which is equivalent to Lasso regression (L1 penalty). i would not advise this with only 4 predictors in X. Lasso should be used when you have a large number of predictors and you expect many of them not to have any predictive power.

This is an important point. When I tried with alpha=0.0, I did not get betas which are all 0. @pavanramkumar I think we should document clearly what alpha is. I always have to look at the code to know if alpha=0.0 is L1 or L2 ...

jasmainak avatar Feb 23 '20 05:02 jasmainak

First of all: Thanks to both of you.

Concerning pavanramkumar answer's point 1) and 3) :
It was just an extremely minimal sample. In fact, I like to run it on 150predictors and about 100MB+ of input data.

I will consider your point 2).

Just to be on the same page, I like to give you now more data because it seems relevant to not be minimal anymore:

Download this file first and extract it in your python working dir: glmnetPy.zip

It is still just 10KB. (10predictors, 218steps, still very minimal). If your answers depend on the amount of input data, I can give you even more input data.
I still get nans for this (or zeros).

Can you reproduce? Any ideas? (apart from pavanramkumar-point (2), which I will still try)

erik-jan-climate avatar Feb 23 '20 12:02 erik-jan-climate

@pavanramkumar @jasmainak All 3 points are now incorporated in the script: More data, standardize, less zeros in y. However, it still returns nans oder 0.

sc = StandardScaler()
Xsc = sc.fit_transform(X.to_numpy())
ysc = y.to_numpy().flatten()

Any idea is appreciated. If you think this can be tested/improved better with more input data, please let me know.

erik-jan-climate avatar Feb 24 '20 13:02 erik-jan-climate

@titipata do you have time to investigate what is going on?

jasmainak avatar Feb 25 '20 05:02 jasmainak

@jasmainak I will try to investigate by the end of this week.

titipata avatar Feb 25 '20 19:02 titipata

hi everyone! I'm trying out your library and having similar issues trying to fit a regularized Gamma GLM. My predictions are all nan most of the time, I'm trying out different settings for the learning_rate, tol and iterations, but it seems tough to get something that makes sense. Any tips on this one? I'm going to check whether similar libraries in R also suffer from this issue with my dataset. Thanks for any advice!

blah-crusader avatar Oct 16 '20 12:10 blah-crusader

ummm ... do you mind sharing your data and a small snippet of your code?

jasmainak avatar Oct 19 '20 14:10 jasmainak