shap
shap copied to clipboard
Questions about TreeExplainer runtime speed
I have tried running shap on a lightgbm model which has a training set size of 40k with 400 features and a test set of 90k. I have not yet been able to get output from shap because it seems prohibitively slow. I have let it run for hours and still it did not complete. I have made great efforts to reduce both training set size and feature size in order to get shap to run in reasonable amount of time. I'm wondering if
- it is possible to provide estimated runtime remaining, as it might make waiting easier, so I know it's not going to take years to finish
- There is a way to speedup computation. Is the "Independent" feature_dependence method any quicker or is it slower? I know "approximate" should be faster, but it doesn't seem much better than number of splits importance, as both are "inconsistent". Is shap runtime more dependent on number of features or sample size, and is it the training set or the test set which runtime is more dependent on?
Thanks, Adam
Also meet the same question 👉https://github.com/slundberg/shap/issues/829
@aamster @manoshape I'll try to answer both questions right here for TreeExplainer.
- TreeExplainer with
feature_dependence="tree_path_dependent"does not depend on the training set size, only on the number and depth of the trees in the model. The strongest dependence is on the depth of the trees, where the runtime is quadratic in the actual depth of the trees. - When you use
feature_dependence="independent"then TreeExplainer does not depend on the number of background samples you give it. If you give it 1000 background samples it will 100x slower than if you give it 1. I would never use more than 1000 background samples, and just subsample if I had more. - Of course both depend on the number of samples you are explaining. Explaining 1000 samples is about 100 times slower than explaining 10 samples. This gives you an easy way to estimate runtime for many samples, just try it on a few samples and extrapolate.
Are there specific cases where you find the runtime not reasonable given these options?
TreeExplainer with feature_dependence="tree_path_dependent" does not depend on the training set size, only on the number and depth of the trees in the model.
@slundberg I have run into issues with longer run times when using the TreeExplainer with models trained on larger training data sets, but with the same number of trees and the same depth for each tree. Below is an example.
import shap
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
X,y = shap.datasets.nhanesi()
X = X.dropna()
y = y[X.index]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
m1 = RandomForestRegressor(n_estimators=20, random_state=0, max_depth=10)
m1.fit(X_train[:1000], y_train[:1000])
m2 = RandomForestRegressor(n_estimators=20, random_state=0, max_depth=10)
m2.fit(X_train[:100], y_train[:100])
# Confirm that there are no differences in tree depths
all([m1_est.get_depth() == m2_est.get_depth() for m1_est, m2_est in zip(m1.estimators_, m2.estimators_)])
## True
Timing reveals that the m1 model trained on the 10x larger training set takes ~4x the time to calculate the shap values with the default feature_dependence="tree_path_dependent" (using %%timeit in jupyterlab):
%%timeit
explainer = shap.TreeExplainer(m1)
shap_values = explainer.shap_values(X_test)
## 1.56 s ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
explainer = shap.TreeExplainer(m2)
shap_values = explainer.shap_values(X_test)
## 371 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I work with somewhat large training dataset so this time scaling becomes rather cumbersome to deal with. I am considering switching to the approximate calculation, but since you mentioned that the training data set size should not affect the calculation, I thought I would post here in case there is something that can be fixed with the TreeExplainer implementation so that timing would not increase with the size of the training data set.
@joelostblom the runtime with feature_perturbation="tree_path_dependent" (the new name for the feature_dependence option) only depends on the number of samples being explained and the size of the model. My guess is that while the max_depth of your models is 10, the random forest does not build complete depth-10 trees until it has a lot of training data. So what you are really seeing is the size of the model (specifically the size of the trees) growing as you have more data.
My guess is that while the max_depth of your models is 10, the random forest does not build complete depth-10 trees until it has a lot of training data.
@slundberg I checked that in my code above (also pasted below) and all the estimator have the same actual depth, not just max depth.
# Confirm that there are no differences in tree depths
all([m1_est.get_depth() == m2_est.get_depth() for m1_est, m2_est in zip(m1.estimators_, m2.estimators_)])
## True
So all trees with the small training data set reached depth 10, which is the current max_depth. If I increase the allowed max depth, then the larger training data set creates deeper trees than the smaller one, but they both have enough data to reach depth 10.
But do they fill out the entire tree to that depth? In other words do they have 2**10 nodes in each tree? If they do then something might be wrong, but I would expect that there would be more nodes in the trees from the model trained on more data.
Ah ok, @slundberg you're right, the trees do have different number of leaves in them.
[m1_est.get_n_leaves() for m1_est in m1.estimators_]
## [189, 222, 182, 229, 185, 180, 260, 192, 180, 204, 208, 182, 213, 199, 207, 132, 196, 190, 191, 168]
[m2_est.get_n_leaves() for m2_est in m2.estimators_]
## [62, 52, 52, 56, 55, 58, 50, 63, 60, 45, 53, 47, 66, 44, 51, 51, 56, 63, 50, 59]
If I set max_leaf_nodes=40 for both m1 and m2, they take about as long to run with TreeExplainer.
So just to check that I got this down right: TreeExplainer depends on
- The number of trees
- Their depth
- The number of leaves in each tree
The last two of those three will increase with more training data, so unless the model is limited to a set depth and number of leaves, more training data will lead to longer run time because of the more complex models that are constructed?
That's right! Nice analysis.
Just a quick (and maybe obvious) question: should there be heavy calculation at initialization of the object or just on predicting SHAP values? I'm seeing pretty long runtimes at shap.TreeExplainer(randomForestRegressor(.)) before I get to .shap_values(X), is this expected, I haven't been able to look under the hood yet though so this might be a silly question.
Likewise, is there any possibility to add a timer? I've been hurt too many times by Python, and I've come to love tqdm. If something like that could be added, I think it might be much appreciated.
I verified that it was indeed shap_values(x) that was taking up the runtime, seems like the javascript might have suppressed the output of the notebook. Still would be great to get a timer, my model (500 trees trained on ~40k samples) took almost 17 hours on my comp, would be wonderful to know in advance if at all possible!
When running the TreeExplainer with feature_perturbation = "interventional" and with a background dataset of randomly sampled data set of 1000 samples my notebook crashes. When using the TreeExplainer default settings it is executed in a very long runtime. How can I shorten the runtime without crashing the notebook?
Just a quick (and maybe obvious) question: should there be heavy calculation at initialization of the object or just on predicting SHAP values? I'm seeing pretty long runtimes at
shap.TreeExplainer(randomForestRegressor(.))before I get to.shap_values(X), is this expected, I haven't been able to look under the hood yet though so this might be a silly question.Likewise, is there any possibility to add a timer? I've been hurt too many times by Python, and I've come to love tqdm. If something like that could be added, I think it might be much appreciated.
Hello @M-Harrington , I'm experiencing the same thing when creating a TreeExplainer from a pyspark random forest model. It's a pretty huge model, so the inicialization of the explainer takes around 2.5h. Of course, then obtaining the shap_values takes a lot too, but I'm just wondering if there are any corners I might be able to cut :)