forest-confidence-interval
forest-confidence-interval copied to clipboard
Improve calibration
The gfit
function is brittle, and fails in some cases. Instead of giving error messages or warnings, the calibration routine just spits out nonsense like infinite, nan or zero variance. Example of failure modes are listed in #93 .
This PR suggest changes to help identify these failures, and to some degree mitigate them. In the process of doing these updats, I also made minor changes to improve readability and maintainability.
My ambition was to:
- Make it correct
- Make it throw warnings and errors when applicable
Minimal reproduction
The simplest way to reproduce some aspects of the problems is to run the example https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/master/examples/plot_mpg.py
, and run it for 100 trees (instead of 1000). This gives rise to the overflow errors and more.
I could isolate the call to gfit
as in the snippet below.
X = [ 1.28250709e+00, -9.62199807e-01, -2.04701193e+00, -1.02443034e+00, 6.38396847e-01, 6.93655605e-01, 2.02769335e-01, 2.28704442e+00, 1.28821259e+00, -2.07737984e+00, 2.45556492e-02, 2.89656660e-01, 4.53925023e-01, 3.35682503e+00, -4.54257770e-01, -1.23714287e-01, 3.77470001e+00, -4.23614898e-01, 7.46533342e-01, -8.81119202e-01, 2.08394853e+00, -1.05411916e-01, 3.31927296e+00, 7.17650800e-01, 2.25908057e+00, -7.53290424e-01, -2.21822602e-01, -6.27804326e-01, 9.58914997e-02, -1.14970276e+00, 1.70686211e+00, 1.47393397e-01, 1.29608780e+00, 2.94860970e-01, 9.31508065e-01, 9.68559862e+00, 3.47850459e+00, 5.20463355e-01, 1.07368578e-02, 6.57683255e-01, 1.45860815e+00, 5.14249866e+00, 2.57982131e+01, 3.19180999e-01,-3.30013368e+00, 3.49159227e-01, 4.65446974e-02, -2.08369433e+00, 8.59651030e-01, 5.86331704e-01, 3.32865616e+00, 1.82593891e-01, 7.18148878e-01, 5.89549645e-02, 5.39916639e-01, 1.02031244e+00, 2.64314510e+00, 1.10181223e-01, 1.05365282e+00, 1.84280809e-01, 2.04459181e+00, -6.69517305e-01, 4.63034132e+00, -2.03630094e+00,-4.16557994e-01, 6.05384570e-01, -7.41193944e-01, 3.28397015e-01,-2.02724573e-01, 2.81343215e-01, -4.52882044e-02, -5.11185513e-01, 1.45608437e-02, -1.81530128e+00, -2.56268846e+00, -3.21805107e-01, 8.91628062e+00, 9.19464777e-01, -1.45044659e+00, -4.56431159e+00, 7.72050142e-01, 5.69059211e+00, -7.16028668e-01, 1.48268265e+00, 6.25840167e-02, 1.59672239e-01]
sigma = 4.545817017854692
from forestci.calibration import gfit
gfit(X,sigma)
Analysis
The "calibration" routine is not well documented. There is a reference in the python code to the Wager/Athey paper, but that article does not mention calibration. The original R code by Wager do the calibration, but has references to any article on how it is motivated. So I worked through the code to understand it.
The calibrateEB
function is a denoising function. It assumes that the input are noisy observations and tries to find denoised observations. The modelling approach is mu ~ G, X ~ N(mu, sigma^2)
. If only the prior distribution G
was known, then for each observed variance x
we would take the expectation over the posterior distribution; x_calib = E [ posterior(mu|x)) ]
. Since G
is not known, we model it with empirical bayes. The function gfit
does this fitting for us.
The things I found while analyzing:
- in
calibrateEB
, there is a use oflist()
andmap()
with afunctool.partial
. List comprehensions are typically faster, need fewer imports and are arguably easier to read. - the docstring for
gbayes
had a mistake on theg_est
parameter
Now we can focus on gfit
- the limits in parameter space is unnessecarily low. It is currently chosen to be
min(min(...),0)
, but all nonpositive values are useless, since themask
variable effectively will cancel them out in calculation further down in the code. I suppose this was a mistake in the original code, and should bemax(min(...),0)
https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/ffa722702ed51f13b464fe8228cf72f456dd3db5/forestci/calibration.py#L61-L63
- The model for
G
is a mixture model; a uniform 'slab' of probability 0.1 plus 0.9 probability proportional to ofexp(eta[0]*x + eta[1]*x**2 + ... +eta[p]*x**p)
. The linear combination of the slab and the power expansion is not weighted properly, so the posterior does not normalize to 1. It is correctly done in theneg_loglik
function, but the parenthesis are wrong in the follwoing snippet. The problem can be simplified by introducing ag_slab
density separately, reducing the number of parentheses later in the code.
https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/ffa722702ed51f13b464fe8228cf72f456dd3db5/forestci/calibration.py#L102-L104
- The convolutional kernel to produce the data distribution
f_eta
is called noise_rotate. Ifxvals
is not symmetric around 0 it will differ from the noise kernel ingbayes
, making the whole procedure invalid. I think that a better definition of the noise kernel would be along the lines of
noise_rotate = norm(scale=sigma,loc=xvals.mean()).pdf(xvals)
noise_rotate /= noise_kernel.sum()
- The expansion of the parametric model is numerically unstable. It is currently done like
https://github.com/scikit-learn-contrib/forest-confidence-interval/blob/ffa722702ed51f13b464fe8228cf72f456dd3db5/forestci/calibration.py#L76-L81
but this has numerical issues in later optimization for
eta
. When thep=5
, which is the settings in this library, you have a matrix with vastly different orders of magnitutes. Normalizing the columns improves the situation somewhat. I think the code should do this. Ifxvals >=0
we can also skip the multiplication with the mask.
XX = np.column_stack([ pow(xvals, exp) for exp in range(1, p+1)])
XX /= np.sum(XX,axis = 0, keepdims=True) # normalize each feature column for better numerical stability
It does not solve the whole problem, and relaxing the tolerance a bit on the solver might be required as well. I reduced the tolerance until the MPG example completed successfully.
9. The defaults differ in this code compared to the R-code, without motivation. That code uses p=2
and nbins=1000
while this code uses p=5
and nbins=200
. I think the R code defaults are more sensible.
10. The code warning on arithmetic overflows is harmless. It should be ignored.
11. The optimization routine fails quite often, and I think the user should get a warning in those cases.
Impact of the PR
Impact on the minimal example
I implemented the suggestions above, and the previously failing call to gfit
is now sucessful. No warnings. The following diagnostic plots, using the same X
and sigma
are good for understanding the problem better.
It is clear that the slab
is responsible to large part for the inflated variance in the calibration compared to observed data. The EB model will give us a model of variances that is not really good for this observed dataset, but this is the best it can do. The outlier at ca 27 is likely the culprit - it shows up clearer in the scatter plot below
Impact on the library examples
With my updated gfit
, the MPG example looks like below
The same plot before my changes was
The results are clearly very similar. The impact of the different empirical bayes priors had no dramatic impact here.