forest-confidence-interval Zero confidence intervals

Hi,

I am trying to generate confidence intervals for the below sample data. I am using RandomForestRegressor with bootstrapping enabled.

X_train shape (270, 7) [[ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] ... [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.]]

Test data shape (36,7) [[ 12. 10. 1. 1. 300. 1. 0.] [ 12. 10. 5. 1. 300. 1. 0.] [ 12. 10. 10. 1. 300. 1. 0.] ... [ 12. 10. 1. 4. 300. 1. 0.] [ 12. 20. 1. 4. 300. 1. 0.] [ 12. 30. 1. 4. 300. 1. 0.]]

I generate ci data as

ci_data = fci.random_forest_error(model, x_train, x_test,  calibrate=True)

However, ci_data contains all zeroes

[1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34]

Do you have any pointers as to what could be going wrong here? Thanks.

Mar 01 '18 01:03 swarnendubiswas

I'm not sure. Might be related to the feature that has the same value for every observation?

On Wed, Feb 28, 2018 at 5:45 PM, Swarnendu Biswas [email protected] wrote:

Hi,

I am trying to generate confidence intervals for the below sample data. I am using RandomForestRegressor with bootstrapping enabled.

X_train shape (270, 7) [[ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] [ 12. 20. 1. ... 300. 1. 0.] ... [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.] [ 12. 30. 10. ... 300. 1. 0.]]

Test data shape (36,7) [[ 12. 10. 1. 1. 300. 1. 0.] [ 12. 10. 5. 1. 300. 1. 0.] [ 12. 10. 10. 1. 300. 1. 0.] ... [ 12. 10. 1. 4. 300. 1. 0.] [ 12. 20. 1. 4. 300. 1. 0.] [ 12. 30. 1. 4. 300. 1. 0.]]

I generate ci data as

ci_data = fci.random_forest_error(model, x_train, x_test, calibrate=True)

However, ci_data contains all zeroes

[1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34 1.89472718e-34]

Do you have any pointers as to what could be going wrong here? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/forest-confidence-interval/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHPNuEqqPpo7unzy3ag1M-mIRXa-gJtks5tZ1KrgaJpZM4SXodE .

Mar 01 '18 20:03 arokem

Thanks for the tip. I had tried it, but it does not help. I trimmed my data set to now contain only features that change and the label. But I still get the zero CIs.

Without normalization (which is possibly not a must for random forests), I get the following error:

/usr/local/lib/python3.6/dist-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)

This happens because all entries in mask are zero:

    def neg_loglik(eta):
        mask = np.ones_like(xvals)
        mask[np.where(xvals <= 0)[0]] = 0

I have attached the data csv if you are interested.

data.zip

Mar 02 '18 06:03 swarnendubiswas

Could you also send along the code you ran?

Mar 05 '18 04:03 arokem

I have got it to work if I use a StandardScaler() or MinMaxScaler(). Otherwise, I get the following error:

/lib/python3.6/site-packages/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide
  g_eta_main = g_eta_raw / sum(g_eta_raw)
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan]

data.csv.zip This is the code and I have attached the data file:

import csv
import sys
import numpy as np
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import sklearn.model_selection as xval
import forestci as fci

# Read data from csv file
file = open("./data.csv", "r")
mpg_X = []
mpg_Y = []
reader = csv.reader(file)
for line in reader:
    line = [float(x) for x in line]
    mpg_X.append(line[1:4])
    mpg_Y.append(line[-1])

mpg_X = np.array(mpg_X)
mpg_Y = np.array(mpg_Y)

# xscaler = MinMaxScaler()
# yscaler = MinMaxScaler()

# n_mpg_x = xscaler.fit_transform(mpg_X)
# n_mpg_y = yscaler.fit_transform(mpg_Y.reshape(-1, 1))
n_mpg_x = mpg_X
n_mpg_y = mpg_Y

# split mpg data into training and test set
mpg_X_train, mpg_X_test, mpg_y_train, mpg_y_test = xval.train_test_split(n_mpg_x, n_mpg_y, test_size=0.1,
                                                                         random_state=42)

mpg_forest = RandomForestRegressor(n_estimators=200, random_state=42)
mpg_forest.fit(mpg_X_train, mpg_y_train.ravel())
mpg_y_hat = mpg_forest.predict(mpg_X_test)

# Calculate the variance:
# inbag = fci.calc_inbag(mpg_X_train.shape[0], mpg_forest)
mpg_V_IJ_unbiased = fci.random_forest_error(mpg_forest, mpg_X_train, mpg_X_test,   # inbag=inbag,
                                            calibrate=True)
print(mpg_V_IJ_unbiased)

plt.scatter(mpg_y_test, mpg_y_hat)
min_x = min(min(mpg_y_test), min(mpg_y_hat))
max_x = max(max(mpg_y_test), max(mpg_y_hat))

plt.plot([min_x, max_x], [min_x, max_x], 'k--')
plt.xlabel('Reported label')
plt.ylabel('Predicted label')
plt.show()

plt.errorbar(mpg_y_test, mpg_y_hat, yerr=np.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([min_x, max_x], [min_x, max_x], 'k--')
plt.xlabel('Reported label')
plt.ylabel('Predicted label')
plt.show()

Mar 05 '18 18:03 swarnendubiswas

@arokem Did you have a chance to try out the code? Were you able to reproduce the error I faced?

May 31 '18 16:05 swarnendubiswas

Thanks for posting this. I was also getting nans, and was able to work backwards by starting with your example and filling in my own data.

Apr 17 '19 23:04 stl-christywilloughby

@stl-christywilloughby Welcome. Have you been able to fix the NaNs? If yes, were you able to identify any patterns in the data or usage that causes this?

Apr 18 '19 02:04 swarnendubiswas

I also get the same error: /lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)

I've been running this code on my own dataset and this error appears both when I use StandardScaler and if I don't use StandardScaler. Is there any update on the potential cause for this?

Jun 24 '19 20:06 charlesxjyang

I am having the same error using my own dataset. It appears that not all of the features in my dataset produce this error (i.e., when I subsample my dataset, the code works, depending upon the exact subsample chosen):

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/forestci/calibration.py:102: RuntimeWarning: invalid value encountered in true_divide g_eta_main = g_eta_raw / sum(g_eta_raw)

Feb 02 '20 13:02 jayarehart

@swarnendubiswas

I encountered the same issue. After I tried out your codes with MinMaxScaler(), this probrem is solved with confidence intervals in the scale of the MinMaxScaler. How could we restore these confidence intervals as the initial units?

Thanks.

Dec 25 '20 04:12 haijunli0629

@smile4lee Sorry I do not get your question.

Dec 29 '20 10:12 swarnendubiswas

@smile4lee Sorry I do not get your question.

@swarnendubiswas Sorry for the confusion. I mean the similar issues mentioned in #83. We need the variance in same order as the orginal data (without scaler), how could we transform the variances corrersponding to the orginal data?

Dec 31 '20 08:12 haijunli0629

According to the documentation the forestci.random_forest_error performs calibration. Set the calibration to False and you will not receive the NaNs. As for the calibration method you will have to go in detail over the code.

I personally do the calibration after I obtained the standard deviation with forestci.random_forest_error

Hope this helps.

Jun 30 '21 21:06 DariusRoman

Is there any update regarding this issue? Avoid calibration doesn't seem to be a solution, since estimated variance can be negative.

Thank you in advance!

Mar 14 '22 16:03 Niccolo-Ajroldi

This thread helped me realize that the issue was with calibration, so I turned it off. But as @Niccolo-Ajroldi said, this is not a solution but a work around. I will do the calibration differently for now

Jul 26 '22 22:07 itsamejoshab

forest-confidence-interval forest-confidence-interval copied to clipboard

Zero confidence intervals

forest-confidence-interval
forest-confidence-interval copied to clipboard