forest-confidence-interval Array dimensions incorrect for confidence intervals

Hi,

I'm trying to create error estimates and am using RandomForestRegressor with bootstrapping enabled. I am using data with dimensions:

x_test [10,13] x_train [90,13] y_test [10,2] y_train [90,2]

I then generate errors using:

y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)

However I get the error:

Generating point estimates...
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2626600/1096083143.py in <module>
----> 1 point_estimates = model.point_estimate(save_estimates=True, make_plots=False)
      2 print(point_estimates)

/scratch/wiay/lara/galpro/galpro/model.py in point_estimate(self, save_estimates, make_plots)
    158         # Use the model to make predictions on new objects
    159         y_pred = self.model.predict(self.x_test)
--> 160         y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
    161 
    162         # Update class variables

~/.local/lib/python3.7/site-packages/forestci/forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit)
    279     n_trees = forest.n_estimators
    280     V_IJ = _core_computation(
--> 281         X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit
    282     )
    283     V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)

~/.local/lib/python3.7/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode)
    135     """
    136     if not memory_constrained:
--> 137         return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0)
    138 
    139     if not memory_limit:

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (90,100) and (100,10,2) not aligned: 100 (dim 1) != 10 (dim 1)

Does anyone have any idea what is going wrong here?? Thanks!

Mar 03 '22 13:03 ljaniurek

Hi, it looks like you have a MultiOutput regressor with 2 targets: my PR https://github.com/scikit-learn-contrib/forest-confidence-interval/pull/113 to address this problem was just merged

Now you can obtain the errors with:

y0_error = fci.random_forest_error(self.model, self.x_train, self.x_test, y_output=0)
y1_error = fci.random_forest_error(self.model, self.x_train, self.x_test, y_output=1)

Sep 09 '22 15:09 danieleongari

Hi, I'm trying to estimate ci for a 7-output regression. for not sure what I'm doing wrong here. Thank you for any recommendations.

Shape of data:

print(X_train.shape) print(X_test.shape) (1400, 11) (600, 11)

print(y_train.shape) print(y_test.shape) (1400, 7) (600, 7)

The model:

model= RandomForestRegressor(n_estimators=4, random_state=42) RandomForestRegressor(n_estimators=4, random_state=42)

Calling fci:

y0_error = fci.random_forest_error(model, X_train, X_test, y_output=0) y1_error = fci.random_forest_error(model, X_train, X_test, y_output=1)

TypeError: random_forest_error() got an unexpected keyword argument 'y_output'

May 06 '24 17:05 tpflana

Hi @tpflana, I guess you did install version 0.6 (i.e., with default pip install forestci and not the latest master version: there was not a release after I implemented my shortcut to handle multiple outputs (https://github.com/scikit-learn-contrib/forest-confidence-interval/pull/113).

@arokem what about bumping to 0.7?

In the meanwhile you can use instead:

pip install git+https://github.com/scikit-learn-contrib/forest-confidence-interval.git@master

to install the latest master version which has my fix.

May 06 '24 18:05 danieleongari

Thank you very much @danieleongari This solved the parameter issue, but now there are new issues. Looking forward to the new release.

File ~/opt/anaconda3/lib/python3.11/site-packages/forestci/forestci.py:70, in calc_inbag(n_samples, forest) 67 raise ValueError(e_s) 69 n_trees = forest.n_estimators ---> 70 inbag = np.zeros((n_samples, n_trees)) 71 sample_idx = [] 72 if isinstance(forest, BaseForest):

TypeError: only integer scalar arrays can be converted to a scalar index

May 06 '24 21:05 tpflana

I wonder how is possbile that n_samples or n_trees are not both integers.

Maybe you incurred into some numerical error by using a copy paste of the settings from a test. As you declared:

RandomForestRegressor(n_estimators=4)

you can not use just 4 trees for your dataset. Try to use the default n_estimators=100 or a number equal to your training samples: you need even more trees than usual to compute the infinitesimal jackknife confidence.

May 07 '24 12:05 danieleongari

Oh, maybe I got what happened

y0_error = fci.random_forest_error(model, X_train, X_test, y_output=0)

should be instead

y0_error = fci.random_forest_error(model, X_train_shape, X_test, y_output=0)

because of https://github.com/scikit-learn-contrib/forest-confidence-interval/pull/111.

@arokem this repository is not anymore well maintained, it has conflicting merges, outdated documentation and outdated releases (https://github.com/scikit-learn-contrib/forest-confidence-interval/issues/112). If you don't have the time anymore to keep an eye on it, isn't there someone else who can support you in that? I can help, let's get in touch

May 08 '24 06:05 danieleongari

No luck... I made the two changes: n_estimators = 100. It doesn't recognize X_train_shape. Tried passing train shape: y0_error = fci.random_forest_error(model, X_train.shape, X_test, y_output=0)

forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask forestci/calibration.py:102: RuntimeWarning: invalid value encountered in divide g_eta_main = g_eta_raw / sum(g_eta_raw) forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask forestci/calibration.py:102: RuntimeWarning: invalid value encountered in divide g_eta_main = g_eta_raw / sum(g_eta_raw)

May 08 '24 14:05 tpflana

ok, you solved the main problems but you remained with a numerical error because you are using calibrate=True with few trees.

My final suggestion is:

to use a larger number of n_estimators, keeping calibrate=True, will be anyway fragile, but if it works it is ok
use a very large number of n_estimators = 10x your training samples, so that you can avoid calibration (calibrate=False), very computational expensive

You need many trees to be able to estimate the confidence interval.

May 08 '24 14:05 danieleongari

Will give it a try. Thank you very much for your help.

May 08 '24 14:05 tpflana

Using a very large number of trees and no calibration worked! Thank you again for your help!

May 08 '24 14:05 tpflana

@danieleongari : you are correct that I don't have the bandwidth to keep maintaining this software very well. I just gave you admin rights on the repo. If you tell me your pypi user-name I can also give you permissions on the pypi package to manage releases.

May 08 '24 16:05 arokem

@arokem thanks for the trust, I appreciated. I'm not sure I can devote a lot of time but at least I can give an eye on the issues and PRs. My PyPI account Is the same as my GitHub username: danieleongari

May 08 '24 16:05 danieleongari

forest-confidence-interval forest-confidence-interval copied to clipboard

Array dimensions incorrect for confidence intervals

forest-confidence-interval
forest-confidence-interval copied to clipboard