forest-confidence-interval icon indicating copy to clipboard operation
forest-confidence-interval copied to clipboard

forest error are all NaN

Open newTypeGeek opened this issue 5 years ago • 9 comments

I have encountered this warning message when executing forestci.random_forest_error(regressor, x_train, x_test)

RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)

and the results are

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan])

Is it something related to zero division?

newTypeGeek avatar Sep 04 '20 10:09 newTypeGeek

I encounter the same probrem.

haijunli0629 avatar Dec 24 '20 07:12 haijunli0629

I am having the same problem, as well.

markvmiller avatar Dec 31 '20 19:12 markvmiller

I am having the same problem, as well.

EyalSel avatar Feb 06 '21 15:02 EyalSel

fwiw my data has categorical features, which from a cursory perusal of the code I think might have played a role in this error.

EyalSel avatar Feb 06 '21 16:02 EyalSel

I had the same issue with a RandomForestRegressor, but applying StandardScaler to the data SOLVED the issue: looks like this is a numerical issue.

danieleongari avatar Sep 08 '22 11:09 danieleongari

Hey @danieleongari : do you want to add a note about this to the docs?

arokem avatar Sep 08 '22 19:09 arokem

Sure, but I would like to first reproduce and isolate the problem to see where the non-scaled inputs make the problem raise. Do you have an idea of which line of code could go in overflow? ps: consider that for this I have to wait for some free time to dedicate, don't expect it soon :-)

danieleongari avatar Sep 08 '22 19:09 danieleongari

In my testing if I use the following sample generator

def func(x, noise=0.1, factor=1):
    return (np.sqrt(x[0]) + x[0]*x[1] + 2*x[1] + sum(x) + noise*np.random.normal()) * factor

in a script e.g.,

n_features = 3
n_train_samples = 21
X_train = np.random.rand(n_train_samples, n_features)
y_train = np.apply_along_axis(func1d=func, axis=1, arr=X_train)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
fci.random_forest_error(
    forest=rf,
    X_train_shape=X_train.shape,
    X_test=X_train,
    inbag=None,
    calibrate=True,
    memory_constrained=False,
    memory_limit=None,
    y_output=None
)

when factor>10000 I always get

forest-confidence-interval\forestci\calibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
forest-confidence-interval\forestci\calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan])

This makes clear that having large absolute values in the samples leads to overflow: StandardScaler by definition helps mitigating this problem, but we should look for a more stable solution which avoids the numerical ceiling on high numbers.

danieleongari avatar Sep 13 '22 08:09 danieleongari