ngboost icon indicating copy to clipboard operation
ngboost copied to clipboard

Probelms with the output for regression with LogNormal

Open maximilianpfau opened this issue 4 years ago • 7 comments

Dear @alejandroschuler !

Thank you for the ngboost and you time earlier today.

Unfortunately, the output of regression with a LogNormal distribution appears to be incorrect.

https://colab.research.google.com/drive/1qr1l6h0NrD5PGYfF-FtiTnojb_FKEFAZ?usp=sharing

Specifically: Error 1: Retrival of the shape parameter "s" does not produce any output (cf. below) Error 2: Retrival of the scale parameter "scale" produces the wrong values (i.e., the "s" value from Y_dists.params) Error 3: The expression "np.exp(Y_dists.loc)" produces the "scale" values from Y_dists.params

Thank you for your input in advance, many thanks, Maximilian

maximilianpfau avatar Aug 12 '20 22:08 maximilianpfau

Thanks for the question and the clear demonstration of the issue!

I looked into this and it turns out this is actually the intended behavior! The problem is that you are accessing the parameter values directly from the predicted distributions instead of from the provided dictionary. In other words, if you want the scale parameter s for the first 10 observations you should do Y_dists[0:10].params['s'], not Y_dists[1:10].s. The class properties that encode the parameters are internal to those classes and should not be accessed by the user.

This is clearly a little confusing, so I do think this points towards doing one or two things:

  1. rename all of the internal parameters in all distributions with a leading underscore (e.g. you would have to type Y_dists._scale`) to make it clear (by python convention) that these are supposed to be internal and not accessed by the user
  2. consider using a sigma, mu parametrization of the lognormal, instead of the s = sigma, scale = exp(mu) parametrization that scipy.stats uses. This seems to be more commonly understood?

@ryan-wolbeck @avati @tonyduan Thoughts?

alejandroschuler avatar Aug 13 '20 21:08 alejandroschuler

@alejandroschuler I really like the idea of doing your first suggestion, it would then follow PEP 8. I don't really have an opinion on the second piece.

ryan-wolbeck avatar Aug 14 '20 19:08 ryan-wolbeck

Thank you for the help!

maximilianpfau avatar Aug 15 '20 00:08 maximilianpfau

@ryan-wolbeck are you interested in taking this on?

alejandroschuler avatar Aug 18 '20 17:08 alejandroschuler

@alejandroschuler yeah I can do this, not sure when I'll get to it but I'll put it on the backlog

ryan-wolbeck avatar Aug 19 '20 14:08 ryan-wolbeck

Thank you @ryan-wolbeck!

tonyduan avatar Aug 19 '20 18:08 tonyduan

I also think that there is right now some confusion ... in particular, to clarifiy maybe what is was said by @maximilianpfau

.params['scale']!= .scale ,

which is indeed a bit annoying (one has to figure that out by printing them )

Other than that, I also tested that

  • .dist.median ( = exp(.loc) )
  • .dist.mean ( = exp( .loc + .params['s'] **2 / 2 )
  • (similarly for the variance)

are indeed correct

ggrrll avatar Jan 21 '21 15:01 ggrrll