forest-confidence-interval icon indicating copy to clipboard operation
forest-confidence-interval copied to clipboard

NaNs in V_IJ

Open ericmjl opened this issue 7 years ago • 4 comments

On a toy problem, in which I am using Random Forests + ForestCI to prototype some ideas, I will randomly get NaNs in the V_IJ estimates.

This toy problem is using a random forest to fit a 1D curve. Code to try reproducing the problem is below.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from forestci import random_forest_error

def func(x):
    return x**2 + 3*x - 3

x_train = np.hstack([np.linspace(-10, -3, 10), np.linspace(3, 10, 10)])
x_test = np.linspace(-10, 10, 1000)

fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(20, 8))


for i in range(10):
    rfr = RandomForestRegressor(n_estimators=100, n_jobs=1)
    rfr.fit(x_train.reshape(-1, 1), func(x_train))
    preds = rfr.predict(x_test.reshape(-1, 1))
    var_est = random_forest_error(rfr, 
                                  x_train.reshape(-1, 1), 
                                  x_test.reshape(-1, 1), 
                                  calibrate=True)
    axes.flatten()[i].errorbar(x_test, preds, yerr=var_est)

I noticed two observations. Firstly, the estimated errors are unstable. Please see image below.

unstable-errors

Is this a result of having few training samples (only 20 observations on the curve)? Or is there something else I'm missing conceptually?

Secondly, I will occasionally get NaNs in var_est (the estimate of V_IJ), hence the errors are unplottable.

occasional-nans

I'm not quite sure how to diagnose what is happening here. Would you guys be able to provide some input on where I might be doing something wrong?

ericmjl avatar Nov 15 '18 13:11 ericmjl

I have faced a similar issue. I have a target variable with a long right tail. If I cap my target variable the forest error does not have missing values. As I raise my target variable cap the missing variables increase. Any insight as to why this issue occurs would be useful! Thanks!

MillHaus33 avatar May 23 '19 15:05 MillHaus33

We are having the same problem. Has anyone found a solution to this?

hermanc1 avatar Jun 27 '19 07:06 hermanc1

Is there any solution for this problem? My NaNs seem to derive from eb_prior = gfit(variances, sigma) in calibrateEB. Thanks in advance!

lcol90 avatar Feb 18 '20 14:02 lcol90

Similar to @MillHaus33 I see this problem with a target variable with a long-tailed distribution (or more specifically, quite a few large value outliers). The problem appears for certain test/train splits, so it's the case that the distribution of the target variable always causes the problem. From a practical point of view, there may some kind of data transformation that doesn't impair the model performance but also avoids this problem. I have not found evidence to support this yet though.

DavidLloydNGP avatar Jan 25 '22 15:01 DavidLloydNGP