forest-confidence-interval Improve calibration

Improve calibration

Open el-hult opened this issue 8 months ago • 1 comments

The gfit function is brittle, and fails in some cases. Instead of giving error messages or warnings, the calibration routine just spits out nonsense like infinite, nan or zero variance. Example of failure modes are listed in #93 .

This PR suggest changes to help identify these failures, and to some degree mitigate them. In the process of doing these updats, I also made minor changes to improve readability and maintainability.

My ambition was to:

Make it correct
Make it throw warnings and errors when applicable

Minimal reproduction

The simplest way to reproduce some aspects of the problems is to run the example https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/master/examples/plot_mpg.py, and run it for 100 trees (instead of 1000). This gives rise to the overflow errors and more.

I could isolate the call to gfit as in the snippet below.

X = [ 1.28250709e+00, -9.62199807e-01, -2.04701193e+00, -1.02443034e+00, 6.38396847e-01,  6.93655605e-01,  2.02769335e-01,  2.28704442e+00, 1.28821259e+00, -2.07737984e+00,  2.45556492e-02,  2.89656660e-01, 4.53925023e-01,  3.35682503e+00, -4.54257770e-01, -1.23714287e-01, 3.77470001e+00, -4.23614898e-01,  7.46533342e-01, -8.81119202e-01, 2.08394853e+00, -1.05411916e-01,  3.31927296e+00,  7.17650800e-01, 2.25908057e+00, -7.53290424e-01, -2.21822602e-01, -6.27804326e-01, 9.58914997e-02, -1.14970276e+00,  1.70686211e+00,  1.47393397e-01, 1.29608780e+00,  2.94860970e-01,  9.31508065e-01,  9.68559862e+00, 3.47850459e+00,  5.20463355e-01,  1.07368578e-02,  6.57683255e-01, 1.45860815e+00,  5.14249866e+00,  2.57982131e+01,  3.19180999e-01,-3.30013368e+00,  3.49159227e-01,  4.65446974e-02, -2.08369433e+00, 8.59651030e-01,  5.86331704e-01,  3.32865616e+00,  1.82593891e-01, 7.18148878e-01,  5.89549645e-02,  5.39916639e-01,  1.02031244e+00, 2.64314510e+00,  1.10181223e-01,  1.05365282e+00,  1.84280809e-01, 2.04459181e+00, -6.69517305e-01,  4.63034132e+00, -2.03630094e+00,-4.16557994e-01,  6.05384570e-01, -7.41193944e-01,  3.28397015e-01,-2.02724573e-01,  2.81343215e-01, -4.52882044e-02, -5.11185513e-01, 1.45608437e-02, -1.81530128e+00, -2.56268846e+00, -3.21805107e-01, 8.91628062e+00,  9.19464777e-01, -1.45044659e+00, -4.56431159e+00, 7.72050142e-01,  5.69059211e+00, -7.16028668e-01,  1.48268265e+00, 6.25840167e-02,  1.59672239e-01]
sigma = 4.545817017854692
from forestci.calibration import gfit
gfit(X,sigma)

Analysis

The "calibration" routine is not well documented. There is a reference in the python code to the Wager/Athey paper, but that article does not mention calibration. The original R code by Wager do the calibration, but has references to any article on how it is motivated. So I worked through the code to understand it.

The calibrateEB function is a denoising function. It assumes that the input are noisy observations and tries to find denoised observations. The modelling approach is mu ~ G, X ~ N(mu, sigma^2). If only the prior distribution G was known, then for each observed variance x we would take the expectation over the posterior distribution; x_calib = E [ posterior(mu|x)) ] . Since G is not known, we model it with empirical bayes. The function gfit does this fitting for us.

The things I found while analyzing:

in calibrateEB, there is a use of list() and map() with a functool.partial. List comprehensions are typically faster, need fewer imports and are arguably easier to read.
the docstring for gbayes had a mistake on the g_est parameter

Now we can focus on gfit

the limits in parameter space is unnessecarily low. It is currently chosen to be min(min(...),0), but all nonpositive values are useless, since the mask variable effectively will cancel them out in calculation further down in the code. I suppose this was a mistake in the original code, and should be max(min(...),0)

https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/ffa722702ed51f13b464fe8228cf72f456dd3db5/forestci/calibration.py#L61-L63

The model for G is a mixture model; a uniform 'slab' of probability 0.1 plus 0.9 probability proportional to of exp(eta[0]*x + eta[1]*x**2 + ... +eta[p]*x**p) . The linear combination of the slab and the power expansion is not weighted properly, so the posterior does not normalize to 1. It is correctly done in the neg_loglik function, but the parenthesis are wrong in the follwoing snippet. The problem can be simplified by introducing a g_slab density separately, reducing the number of parentheses later in the code.

https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/ffa722702ed51f13b464fe8228cf72f456dd3db5/forestci/calibration.py#L102-L104

The convolutional kernel to produce the data distribution f_eta is called noise_rotate. If xvals is not symmetric around 0 it will differ from the noise kernel in gbayes, making the whole procedure invalid. I think that a better definition of the noise kernel would be along the lines of

noise_rotate = norm(scale=sigma,loc=xvals.mean()).pdf(xvals)
noise_rotate /= noise_kernel.sum()

The expansion of the parametric model is numerically unstable. It is currently done like https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/ffa722702ed51f13b464fe8228cf72f456dd3db5/forestci/calibration.py#L76-L81 but this has numerical issues in later optimization for eta. When the p=5, which is the settings in this library, you have a matrix with vastly different orders of magnitutes. Normalizing the columns improves the situation somewhat. I think the code should do this. If xvals >=0 we can also skip the multiplication with the mask.

XX = np.column_stack([ pow(xvals, exp) for exp in range(1, p+1)])
XX /= np.sum(XX,axis = 0, keepdims=True) # normalize each feature column for better numerical stability

It does not solve the whole problem, and relaxing the tolerance a bit on the solver might be required as well. I reduced the tolerance until the MPG example completed successfully. 9. The defaults differ in this code compared to the R-code, without motivation. That code uses p=2 and nbins=1000 while this code uses p=5 and nbins=200. I think the R code defaults are more sensible. 10. The code warning on arithmetic overflows is harmless. It should be ignored. 11. The optimization routine fails quite often, and I think the user should get a warning in those cases.

Impact of the PR

Impact on the minimal example

I implemented the suggestions above, and the previously failing call to gfit is now sucessful. No warnings. The following diagnostic plots, using the same X and sigma are good for understanding the problem better.

bild

It is clear that the slab is responsible to large part for the inflated variance in the calibration compared to observed data. The EB model will give us a model of variances that is not really good for this observed dataset, but this is the best it can do. The outlier at ca 27 is likely the culprit - it shows up clearer in the scatter plot below

bild