forest-confidence-interval
forest-confidence-interval copied to clipboard
Array dimensions incorrect for confidence intervals
Hi,
I'm trying to create error estimates and am using RandomForestRegressor with bootstrapping enabled. I am using data with dimensions:
x_test [10,13] x_train [90,13] y_test [10,2] y_train [90,2]
I then generate errors using:
y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
However I get the error:
Generating point estimates...
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 33 tasks | elapsed: 0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 0.0s finished
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_2626600/1096083143.py in <module>
----> 1 point_estimates = model.point_estimate(save_estimates=True, make_plots=False)
2 print(point_estimates)
/scratch/wiay/lara/galpro/galpro/model.py in point_estimate(self, save_estimates, make_plots)
158 # Use the model to make predictions on new objects
159 y_pred = self.model.predict(self.x_test)
--> 160 y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
161
162 # Update class variables
~/.local/lib/python3.7/site-packages/forestci/forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit)
279 n_trees = forest.n_estimators
280 V_IJ = _core_computation(
--> 281 X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit
282 )
283 V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)
~/.local/lib/python3.7/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode)
135 """
136 if not memory_constrained:
--> 137 return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0)
138
139 if not memory_limit:
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (90,100) and (100,10,2) not aligned: 100 (dim 1) != 10 (dim 1)
Does anyone have any idea what is going wrong here?? Thanks!
Hi, it looks like you have a MultiOutput regressor with 2 targets: my PR https://github.com/scikit-learn-contrib/forest-confidence-interval/pull/113 to address this problem was just merged
Now you can obtain the errors with:
y0_error = fci.random_forest_error(self.model, self.x_train, self.x_test, y_output=0)
y1_error = fci.random_forest_error(self.model, self.x_train, self.x_test, y_output=1)
Hi, I'm trying to estimate ci for a 7-output regression. for not sure what I'm doing wrong here. Thank you for any recommendations.
Shape of data:
print(X_train.shape) print(X_test.shape) (1400, 11) (600, 11)
print(y_train.shape) print(y_test.shape) (1400, 7) (600, 7)
The model:
model= RandomForestRegressor(n_estimators=4, random_state=42) RandomForestRegressor(n_estimators=4, random_state=42)
Calling fci:
y0_error = fci.random_forest_error(model, X_train, X_test, y_output=0) y1_error = fci.random_forest_error(model, X_train, X_test, y_output=1)
TypeError: random_forest_error() got an unexpected keyword argument 'y_output'
Hi @tpflana,
I guess you did install version 0.6 (i.e., with default pip install forestci and not the latest master version: there was not a release after I implemented my shortcut to handle multiple outputs (https://github.com/scikit-learn-contrib/forest-confidence-interval/pull/113).
@arokem what about bumping to 0.7?
In the meanwhile you can use instead:
pip install git+https://github.com/scikit-learn-contrib/forest-confidence-interval.git@master
to install the latest master version which has my fix.
Thank you very much @danieleongari This solved the parameter issue, but now there are new issues. Looking forward to the new release.
File ~/opt/anaconda3/lib/python3.11/site-packages/forestci/forestci.py:70, in calc_inbag(n_samples, forest) 67 raise ValueError(e_s) 69 n_trees = forest.n_estimators ---> 70 inbag = np.zeros((n_samples, n_trees)) 71 sample_idx = [] 72 if isinstance(forest, BaseForest):
TypeError: only integer scalar arrays can be converted to a scalar index
I wonder how is possbile that n_samples or n_trees are not both integers.
Maybe you incurred into some numerical error by using a copy paste of the settings from a test. As you declared:
RandomForestRegressor(n_estimators=4)
you can not use just 4 trees for your dataset. Try to use the default n_estimators=100 or a number equal to your training samples: you need even more trees than usual to compute the infinitesimal jackknife confidence.
Oh, maybe I got what happened
y0_error = fci.random_forest_error(model, X_train, X_test, y_output=0)
should be instead
y0_error = fci.random_forest_error(model, X_train_shape, X_test, y_output=0)
because of https://github.com/scikit-learn-contrib/forest-confidence-interval/pull/111.
@arokem this repository is not anymore well maintained, it has conflicting merges, outdated documentation and outdated releases (https://github.com/scikit-learn-contrib/forest-confidence-interval/issues/112). If you don't have the time anymore to keep an eye on it, isn't there someone else who can support you in that? I can help, let's get in touch
No luck... I made the two changes: n_estimators = 100. It doesn't recognize X_train_shape. Tried passing train shape: y0_error = fci.random_forest_error(model, X_train.shape, X_test, y_output=0)
forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask forestci/calibration.py:102: RuntimeWarning: invalid value encountered in divide g_eta_main = g_eta_raw / sum(g_eta_raw) forestci/calibration.py:86: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta)) * mask forestci/calibration.py:101: RuntimeWarning: overflow encountered in exp g_eta_raw = np.exp(np.dot(XX, eta_hat)) * mask forestci/calibration.py:102: RuntimeWarning: invalid value encountered in divide g_eta_main = g_eta_raw / sum(g_eta_raw)
ok, you solved the main problems but you remained with a numerical error because you are using calibrate=True with few trees.
My final suggestion is:
- to use a larger number of n_estimators, keeping calibrate=True, will be anyway fragile, but if it works it is ok
- use a very large number of n_estimators = 10x your training samples, so that you can avoid calibration (calibrate=False), very computational expensive
You need many trees to be able to estimate the confidence interval.
Will give it a try. Thank you very much for your help.
Using a very large number of trees and no calibration worked! Thank you again for your help!
@danieleongari : you are correct that I don't have the bandwidth to keep maintaining this software very well. I just gave you admin rights on the repo. If you tell me your pypi user-name I can also give you permissions on the pypi package to manage releases.
@arokem thanks for the trust, I appreciated. I'm not sure I can devote a lot of time but at least I can give an eye on the issues and PRs. My PyPI account Is the same as my GitHub username: danieleongari