forest-confidence-interval
forest-confidence-interval copied to clipboard
Zero confidence intervals
Hi,
I am trying to generate confidence intervals for the below sample data. I am using RandomForestRegressor with bootstrapping enabled.
X_train shape (270, 7) [[ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] ... [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.]]
Test data shape (36,7) [[ 12. 10. 1. 1. 300. 1. 0.] [ 12. 10. 5. 1. 300. 1. 0.] [ 12. 10. 10. 1. 300. 1. 0.] ... [ 12. 10. 1. 4. 300. 1. 0.] [ 12. 20. 1. 4. 300. 1. 0.] [ 12. 30. 1. 4. 300. 1. 0.]]
I generate ci data as
ci_data = fci.random_forest_error(model, x_train, x_test, calibrate=True)
However, ci_data contains all zeroes
[1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34]
Do you have any pointers as to what could be going wrong here? Thanks.
I'm not sure. Might be related to the feature that has the same value for every observation?
On Wed, Feb 28, 2018 at 5:45 PM, Swarnendu Biswas [email protected] wrote:
Hi,
I am trying to generate confidence intervals for the below sample data. I am using RandomForestRegressor with bootstrapping enabled.
X_train shape (270, 7) [[ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] ... [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.]]
Test data shape (36,7) [[ 12. 10. 1. 1. 300. 1. 0.] [ 12. 10. 5. 1. 300. 1. 0.] [ 12. 10. 10. 1. 300. 1. 0.] ... [ 12. 10. 1. 4. 300. 1. 0.] [ 12. 20. 1. 4. 300. 1. 0.] [ 12. 30. 1. 4. 300. 1. 0.]]
I generate ci data as
ci_data = fci.random_forest_error(model, x_train, x_test, calibrate=True)
However, ci_data contains all zeroes
[1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34]
Do you have any pointers as to what could be going wrong here? Thanks.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/forest-confidence-interval/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPNuEqqPpo7unzy3ag1M-mIRXa-gJtks5tZ1KrgaJpZM4SXodE .
Thanks for the tip. I had tried it, but it does not help. I trimmed my data set to now contain only features that change and the label. But I still get the zero CIs.
Without normalization (which is possibly not a must for random forests), I get the following error:
/usr/local/lib/python3.6/dist-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
g_eta_main = g_eta_raw / sum(g_eta_raw)
This happens because all entries in mask
are zero:
def neg_loglik(eta):
mask = np.ones_like(xvals)
mask[np.where(xvals <= 0)[0]] = 0
I have attached the data csv if you are interested.
Could you also send along the code you ran?
I have got it to work if I use a StandardScaler() or MinMaxScaler(). Otherwise, I get the following error:
/lib/python3.6/site-packages/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp
g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
g_eta_main = g_eta_raw / sum(g_eta_raw)
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan]
data.csv.zip This is the code and I have attached the data file:
import csv
import sys
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import sklearn.model_selection as xval
import forestci as fci
# Read data from csv file
file = open("./data.csv", "r")
mpg_X = []
mpg_Y = []
reader = csv.reader(file)
for line in reader:
line = [float(x) for x in line]
mpg_X.append(line[1:4])
mpg_Y.append(line[-1])
mpg_X = np.array(mpg_X)
mpg_Y = np.array(mpg_Y)
# xscaler = MinMaxScaler()
# yscaler = MinMaxScaler()
# n_mpg_x = xscaler.fit_transform(mpg_X)
# n_mpg_y = yscaler.fit_transform(mpg_Y.reshape(-1, 1))
n_mpg_x = mpg_X
n_mpg_y = mpg_Y
# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(n_mpg_x, n_mpg_y, test_size=0.1,
random_state=42)
mpg_forest = RandomForestRegressor(n_estimators=200, random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train.ravel())
mpg_y_hat = mpg_forest.predict(mpg_X_test)
# Calculate the variance:
# inbag = fci.calc_inbag(mpg_X_train.shape[0], mpg_forest)
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train, mpg_X_test, # inbag=inbag,
calibrate=True)
print(mpg_V_IJ_unbiased)
plt.scatter(mpg_y_test, mpg_y_hat)
min_x = min(min(mpg_y_test), min(mpg_y_hat))
max_x = max(max(mpg_y_test), max(mpg_y_hat))
plt.plot([min_x, max_x], [min_x, max_x], 'k--')
plt.xlabel('Reported label')
plt.ylabel('Predicted label')
plt.show()
plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([min_x, max_x], [min_x, max_x], 'k--')
plt.xlabel('Reported label')
plt.ylabel('Predicted label')
plt.show()
@arokem Did you have a chance to try out the code? Were you able to reproduce the error I faced?
Thanks for posting this. I was also getting nans, and was able to work backwards by starting with your example and filling in my own data.
@stl-christywilloughby Welcome. Have you been able to fix the NaNs? If yes, were you able to identify any patterns in the data or usage that causes this?
I also get the same error:
/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)
I've been running this code on my own dataset and this error appears both when I use StandardScaler and if I don't use StandardScaler. Is there any update on the potential cause for this?
I am having the same error using my own dataset. It appears that not all of the features in my dataset produce this error (i.e., when I subsample my dataset, the code works, depending upon the exact subsample chosen):
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)
@swarnendubiswas
I encountered the same issue. After I tried out your codes with MinMaxScaler(), this probrem is solved with confidence intervals in the scale of the MinMaxScaler. How could we restore these confidence intervals as the initial units?
Thanks.
@smile4lee Sorry I do not get your question.
@smile4lee Sorry I do not get your question.
@swarnendubiswas Sorry for the confusion. I mean the similar issues mentioned in #83. We need the variance in same order as the orginal data (without scaler), how could we transform the variances corrersponding to the orginal data?
According to the documentation the forestci.random_forest_error performs calibration. Set the calibration to False and you will not receive the NaNs. As for the calibration method you will have to go in detail over the code.
I personally do the calibration after I obtained the standard deviation with forestci.random_forest_error
Hope this helps.
Is there any update regarding this issue? Avoid calibration doesn't seem to be a solution, since estimated variance can be negative.
Thank you in advance!
This thread helped me realize that the issue was with calibration, so I turned it off. But as @Niccolo-Ajroldi said, this is not a solution but a work around. I will do the calibration differently for now