shap
shap copied to clipboard
save explainer?
is there a way to save an estimated explainer?
I tried saving it using joblib but got an error.
fbest, Mads
Which kind of explainer are you trying to save? TreeExplainer used on a LightGBM or XGBoost model keeps a reference to the original model object, which is probably not python serializable.
Just a general explainer from a logistic regression model
So a KernelExplainer object? I would have thought that it would pickle fine but I have not tried.
Hi @MadsJensen just curious, what kind of error do you get? the example below works (scikit 0.19.2 / shap 0.25.1)
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.externals import joblib
import numpy as np
import shap
X, y = shap.datasets.diabetes()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
lin_regr = linear_model.LinearRegression()
lin_regr.fit(X_train, y_train)
ex = shap.KernelExplainer(lin_regr.predict, X_train)
shap_values = ex.shap_values(X_test.iloc[0, :])
ex_filename = 'explainer.bz2'
joblib.dump(ex, filename=ex_filename, compress=('bz2', 9))
ex2 = joblib.load(filename=ex_filename)
shap_values2 = ex2.shap_values(X_test.iloc[0, :])
assert np.array_equal(shap_values, shap_values2)
Hi, sorry I can see my reply got lost.
I do not get an error with the above example and I have tried to recreate my own error but unsuccessful. I apologise for this.
Which kind of explainer are you trying to save? TreeExplainer used on a LightGBM or XGBoost model keeps a reference to the original model object, which is probably not python serializable.
Understood. But having to run shap.explainer over large amount of data every time when I restart the Kernel is a bit of pain. Do you think there is any other way out?
@DSXiangLi I have a bit of an obsession with saving all the work every step of the way, because (as you said) running the explainer at every kernel restart is a bit of a pain, especially with large data. I use a combination of jsonpickled sklearn models and pandas dataframes, all combined into an h5 file. I also do a silly thing where I package up some key metadata into pd dataframes, just so it's all together (e.g. the SHAP expected value, the classification threshold I used, the name of what I was trying to predict).
jsonpickle is a nice library for this, especially if you are working with the sklearn versions of, e.g., LightGBM, which do not have JSON output built in and cannot be stored in h5 without some kind of compatible serialization.
Here's an example for saving the SHAP values (doesn't require jsonpickle). I save more than this, but the basic syntax is the same:
First, generate the explainer:
explainer = shap.TreeExplainer(gbm_model)
features_shap = features.sample(n=20000, random_state=seed, replace=False)
shap_values = explainer.shap_values(features_shap)
shap_expected = explainer.expected_value
def shap_save_to_disk(self):
print("Saving SHAP to .h5 file...")
h5_file = self.h5_file
shap_val_df = pd.DataFrame(self.shap_values) #this step is unnecessary, could just save np arrays directly, but the df have some advantages and I don't mind reconverting upon loading
shap_feat_df = pd.DataFrame(self.features_shap)
# define what goes in the first row with "d"
d = [[self.target, self.name_for_figs, self.shap_expected, self.class_thresh]]
exp_df = pd.DataFrame(
d, columns=("target", "name_for_figs", "shap_exp_val", "class_thresh")
)
shap_val_df.to_hdf(h5_file, key="shap_values", format="table")
shap_feat_df.to_hdf(h5_file, key="features_shap", format="table")
exp_df.to_hdf(h5_file, key="shap_expected_value", format="table")
Here's an example for saving the model (requires jsonpickle), also overwrites any old models in the same file:
def lgbm_save_model_to_h5(self):
print("JSONpickling the model...")
frozen = jsonpickle.encode(self.gbm_model)
print("Saving model to .h5 file...")
h5_file = self.h5_file
with h5py.File(h5_file, 'a') as f:
try:
f.create_dataset('self.gbm_model', data=frozen)
except Exception as exc:
print(traceback.format_exc())
print(exc)
try:
del f["gbm_model"]
f.create_dataset('self.gbm_model', data=frozen)
print("Successfully deleted old model and saved new one!")
except:
print("Old model persists...")
print(h5_file)
Hi, I have created a GradientExplainer
. What is the best way to save this explainer? thank you.
Hi @DarioBernardo ,
I haven't used the GradientExplainer much, but from the documentation it looks like it returns either tensors, lists of tensors, or a pair of a list of tensors and a matrix.
Returns
-------
For a models with a single output this returns a tensor of SHAP values with the same shape
as X. For a model with multiple outputs this returns a list of SHAP value tensors, each of
which are the same shape as X. If ranked_outputs is None then this list of tensors matches
the number of model outputs. If ranked_outputs is a positive integer a pair is returned
(shap_values, indexes), where shap_values is a list of tensors with a length of
ranked_outputs, and indexes is a matrix that tells for each sample which output indexes
were chosen as "top".
You could try h5 files, and I think these are standard fare in several DL libraries, but ymmv.
Hi @cbeauhilton , thank you so much for your answer. Where did you find the documentation? I looked here, and I couldn't find anything about GradientExplainer, serialisation or saving models. Regarding your answer, I am not sure I understand. I want to save the trained GradientExplainer object. I have the following piece of code
to_explain = test_data[224:226]
def map2layer(x, layer):
feed_dict = dict(zip([model.layers[0].input], [x]))
return keras.backend.get_session().run(model.layers[layer].input, feed_dict)
layer_number = 11
e = shap.GradientExplainer(
(model.layers[layer_number].input, model.layers[-1].output),
map2layer(test_data[0:200], layer_number),
local_smoothing=0 # std dev of smoothing noise
)
shap_values,indexes = e.shap_values(map2layer(to_explain, layer_number), ranked_outputs=1)`
Building the GradientExplainer (especially on large data) takes long time. I don't want to build it every time, I want to build it once, save it, and call shap_values
many times. I am not sure where you got the documentation, but my object e
does not return what is in your snippet, it looks more what is returned by shap_values
. Is there a way to save my e
object?
Hi @DarioBernardo , glad to help!
I found the documentation here, starting around line 102: https://github.com/slundberg/shap/blob/master/shap/explainers/gradient.py
I get what you mean about wanting to call shap_values
many times and compute only once. If the return from e
similar to what is returned by shap_values
in other cases, i.e. a numpy array, you can write these any number of ways. I like the h5 format because I can shove a ton of stuff in it and you can write to them in ways that work in many environments. If you don't care about portability beyond your Python workspace, pickles
work for almost everything without much fuss, as they can take arbitrary Python objects (almost) all of the time. See what happens if you do something like:
try:
import cPickle as pickle
except BaseException:
import pickle
...
file_path0 = path/to/wherever/you/want/your/val_file.pkl
file_path1 = path/to/wherever/you/want/your/index_file.pkl
with open(file_path0, "wb") as f:
pickle.dump(shap_values, f)
with open(file_path1, "wb") as f:
pickle.dump(indexes, f)
and then load with something like:
with open(file_path0, 'rb') as f:
shap_values = pickle.load(f)
etc.
EDIT: Just poked around a little more, pickle
might not work well with tensors. If the above doesn't work, try the h5 option, using something like what is found here (it's for numpy arrays, but the basic syntax is the same): https://stackoverflow.com/questions/20928136/input-and-output-numpy-arrays-to-h5py. TensorFlow and others use h5 files for saving tensors, so hopefully it will play nicely.
Also, if the pickle option doesn't work, let me know and I'll clean up this comment so it doesn't confuse anyone in the future.
Hi @cbeauhilton , thanks for helping out. Unfortunately both pickle and h5 don't work. Here is my code:
layer_number = 11
e = shap.GradientExplainer(
(model.layers[layer_number].input, model.layers[-1].output),
map2layer(test_data[1:20], layer_number),
local_smoothing=0 # std dev of smoothing noise
)
with open('explainer.pkl', "wb") as f:
pickle.dump(e, f)
When I try this code I get TypeError: can't pickle _thread.RLock objects
.
I have also tried with h5py, but it also doesn't work, but I may miss something here, as I couldn't find the right way to serialise a generic object (rather than a dataset), anyway here is my code:
layer_number = 11
e = shap.GradientExplainer(
(model.layers[layer_number].input, model.layers[-1].output),
map2layer(test_data[1:20], layer_number),
local_smoothing=0 # std dev of smoothing noise
)
h5f = h5py.File('explainer.h5', 'w')
h5f.crecreate_dataset('explainer', data=e)
h5f.close()
the error I get is
dataset.py, line 118, in make_new_dset
tid = h5t.py_create(dtype, logical=1)
File "h5py/h5t.pyx", line 1630, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1707, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Thanks
Bummer! I'll clean up my old comment when I'm off mobile. Maybe I'll get a chance to play with this soon, but I haven't had a project with a need for the GradientExplainer yet.
What about pickling/h5-ing the desired output of e, rather than e itself? I.e. the shap values and indexes.
I am trying to avoid to recreate the GradientExplainer every time I need to get shap values and indexes. I am using this within an application, not a jupyther notebook. Keras offers a .savemodel() which internally translate the state into h5 and a loadmodel() for the opposite operation, maybe I could open a ticket to ask for this feature? @slundberg
@DarioBernardo Do you need the GradientExplainer itself? Or just the output? I can't think of any situations in which I'd be more interested in the GradientExplainer (e) than its output (shap_values, indexes). Did you try saving shap_values and indexes, rather than e?
I don't use jupyter notebooks very often either, just vanilla .py files, and have had success saving the output of my explainers (but not the explainers themselves) to disk and polling this output whenever needed. I discard the explainer after getting the output.
Hi @cbeauhilton , yes I do need the GradientExplainer itself. I have a script that loads a keras model, run classification and return the classification. I want to run along with this model, the GradientExplainer and return the classification and the explanation. I have a new picture, I run the script, the script loads both the model and the GradientExplainer and return both the classification and the explanation. I don't want to train the explainer every time I need a new explanation. It would take too long. I am aware I can save the explanation later, but I want to reduce the time it takes to produce the explanation. How can I do this? From the following code
e = shap.GradientExplainer(
(model.layers[layer_number].input, model.layers[-1].output),
map2layer(test_data, layer_number),
local_smoothing=0 # std dev of smoothing noise
)
if test_data
is very large, creating the GradientExplainer will take too long .
how important it is to have a large number of training examples when building the GradientExplainer? More data, more accurate explanation? Does this impact the amount of memory used? I am also having problem with the memory used.
Hi @DarioBernardo a few thoughts:
- Are you using PyTorch or TensorFlow?
- What version of shap are you using?
- There should not be much time spent constructing the GradientExplainer, because it does not train anything during construction. At least for TF all the computation graphs are built lazily I think.
- Having more background samples than
nsamples
is not needed. But for GPU memory you might have better luck reducing thebatch_size
parameter.
Hope that helps!
Hi @slundberg , thank you for your answer. Here is to your points.
1 - Tensorflow 2 - shap==0.29.1
4 - If I understand correctly, I can reduce the batch_size
param when I init the gradient explainer, great, I will try that. Stupid question, what do you mean by nsamples
? Are you referring to the number of images I am trying to explain? I am trying to explain just one image, does this mean that the bath_size
and the test_data
(from code in previous comments) can be just 1? In your code snippet example in the project home the Gradient Explainer is called on the whole X
, containing the whole imagenet50
. I am a bit confused. Thanks a lot.
Ah, nsamples
is just the number of gradient evaluations GradientExplainer runs and is an argument to the shap_values
function of GradientExplainer. GradientExplainer uses a sampling approach that needs a few hundred samples usually (by default 200). Running all 200 samples at the same time could run you out of memory, so samples are run in batches. Batch size is 50 by default but you could set it to 10 to reduce your GPU memory needs by a factor of 5.
@DarioBernardo Did you find a way to save the GradientExplainer?
I am using the latest version of SHAP (0.35.0) and I get the same error
TypeError: can't pickle _thread.RLock objects
Hi @slundberg, Thanks for the useful method! I was wondering how to save deepExplainer?
Hi @DarioBernardo Did you find a way to save your gradientexplainer? I'm in the same situation with deepexplainer, same error on pickle and h5.
I see the same error when trying to pickle a tf.keras sequential so I suppose the errors are related. I see here (https://github.com/tensorflow/tensorflow/issues/34697) they're talking about workarounds for pickling the models themselves. The workarounds work for me for the models, but not the explainers. I don't understand the internals well enough unfortunately, but perhaps someone here can figure out if theres a way to apply those fixes to the explainers? :D
cheers
Hi @philmassie , no unfortunately didn't find a solution to this problem. I agree, as suggested in the linked discussion you can pickle the model, but it's different form saving this specific type of explainer.
Sorry to re-bump - was wondering & wanted to clarify, is there interest from the maintainers in having explainer persistence APIs (saving/loading) added to SHAP? We're interested in adding autologging of SHAP-generated plots and explainers to MLflow (https://github.com/mlflow/mlflow), where it seems useful to persist explainers to allow computing explanations on new data in the future.
It sounds it may be fundamentally difficult as in many cases SHAP explainers need to maintain a reference to the original model, which may not be picklable - is that right?
Thanks!
@smurching great question. Short answer, yes, serialization would be great and is tricky sometimes (as this issue highlights).
Explainers do indeed need to maintain a reference to the original model since they need to execute it during the explanation computation (the exception being some tree and linear explainer settings which load the entire model into SHAP and discard the original object).
Support in MLFlow would be great, and should probably align with the API that is in 0.36.0 (the API is still getting a last bit of tire-kicking before we update the docs, hence the expectation people are still using the previous API). The new API makes every explainer a subclass of shap.Explainer, and introduces a new explanation object shap.Explanation that allows nice parallel slices (see https://github.com/slundberg/shap/blob/master/notebooks/plots/bar.ipynb for example). We worked to make the baseline version of this new API jointly work with the InterpretML project, so that other packages can also support the same API.
So to make serialization happen we will want to serialize either explainers and explanations.
Hi, I saved a lgbm model, when someone else loads the saved model and use it for TreeExplainer, the shap values end up being different for a given test data. Is there something that can be done so that whoever uses the save model get the same shap values for the same test data?
lgbm_model.params['objective'] = 'binary'
explainerLGBM=shap.TreeExplainer(lgbm_model)
shap_values_LGBM= explainerLGBM.shap_values(preds_df)
@slundberg So we still cannot save a trained DeepSHAP nor GradientSHAP explainer? I think this would create inconsistency issues for online inference. Since each time, we use different background data to train the explainer, and then do the inference.
Hi, I saved the SHAP explainer and the Shapley values using pickle, a module that serialize/deserialize Python object structures. It's really fast to save and load the SHAP's objects.
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators=100, max_samples='auto', contamination='auto', max_features=13, bootstrap=False, n_jobs=-1, random_state=42)
iforest.fit(X)
explainer = shap.Explainer(iforest.predict, X)
shap_values = explainer(X)
import pickle
filename_expl = 'explainer.sav'pickle.dump(explainer, open(filename_expl, 'wb'))`
load_explainer = pickle.load(open(filename_expl, 'rb'))
print(load_explainer)
filename = 'shapvalues.sav'
pickle.dump(shap_values, open(filename, 'wb'))
load_shap_values = pickle.load(open(filename, 'rb'))
print(load_shap_values)```
Hi, I saved the SHAP explainer and the Shapley values using pickle, a module that serialize/deserialize Python object structures. It's really fast to save and load the SHAP's objects.
from sklearn.ensemble import IsolationForest iforest = IsolationForest(n_estimators=100, max_samples='auto', contamination='auto', max_features=13, bootstrap=False, n_jobs=-1, random_state=42) iforest.fit(X) explainer = shap.Explainer(iforest.predict, X) shap_values = explainer(X) import pickle filename_expl = 'explainer.sav'pickle.dump(explainer, open(filename_expl, 'wb'))` load_explainer = pickle.load(open(filename_expl, 'rb')) print(load_explainer) filename = 'shapvalues.sav' pickle.dump(shap_values, open(filename, 'wb')) load_shap_values = pickle.load(open(filename, 'rb')) print(load_shap_values)```
Thanks @eugeniaring! that solution was the best for save the SHAP values!