forest-confidence-interval
forest-confidence-interval copied to clipboard
forest error are all NaN
I have encountered this warning message when executing forestci.random_forest_error(regressor, x_train, x_test)
RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)
and the results are
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan])
Is it something related to zero division?
I encounter the same probrem.
I am having the same problem, as well.
I am having the same problem, as well.
fwiw my data has categorical features, which from a cursory perusal of the code I think might have played a role in this error.
I had the same issue with a RandomForestRegressor, but applying StandardScaler to the data SOLVED the issue: looks like this is a numerical issue.
Hey @danieleongari : do you want to add a note about this to the docs?
Sure, but I would like to first reproduce and isolate the problem to see where the non-scaled inputs make the problem raise. Do you have an idea of which line of code could go in overflow? ps: consider that for this I have to wait for some free time to dedicate, don't expect it soon :-)
In my testing if I use the following sample generator
def func(x, noise=0.1, factor=1):
return (np.sqrt(x[0]) + x[0]*x[1] + 2*x[1] + sum(x) + noise*np.random.normal()) * factor
in a script e.g.,
n_features = 3
n_train_samples = 21
X_train = np.random.rand(n_train_samples, n_features)
y_train = np.apply_along_axis(func1d=func, axis=1, arr=X_train)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
fci.random_forest_error(
forest=rf,
X_train_shape=X_train.shape,
X_test=X_train,
inbag=None,
calibrate=True,
memory_constrained=False,
memory_limit=None,
y_output=None
)
when factor>10000 I always get
forest-confidence-interval\forestci\calibration.py:86: RuntimeWarning: overflow encountered in exp
g_eta_raw = np.exp(np.dot(XX, eta)) * mask
forest-confidence-interval\forestci\calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
g_eta_main = g_eta_raw / sum(g_eta_raw)
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan])
This makes clear that having large absolute values in the samples leads to overflow: StandardScaler by definition helps mitigating this problem, but we should look for a more stable solution which avoids the numerical ceiling on high numbers.