shap icon indicating copy to clipboard operation
shap copied to clipboard

save explainer?

Open MadsJensen opened this issue 6 years ago • 42 comments

is there a way to save an estimated explainer?

I tried saving it using joblib but got an error.

fbest, Mads

MadsJensen avatar Oct 23 '18 10:10 MadsJensen

Which kind of explainer are you trying to save? TreeExplainer used on a LightGBM or XGBoost model keeps a reference to the original model object, which is probably not python serializable.

slundberg avatar Oct 24 '18 21:10 slundberg

Just a general explainer from a logistic regression model

MadsJensen avatar Oct 31 '18 16:10 MadsJensen

So a KernelExplainer object? I would have thought that it would pickle fine but I have not tried.

slundberg avatar Nov 02 '18 23:11 slundberg

Hi @MadsJensen just curious, what kind of error do you get? the example below works (scikit 0.19.2 / shap 0.25.1)

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.externals import joblib
import numpy as np
import shap

X, y = shap.datasets.diabetes()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

lin_regr = linear_model.LinearRegression()
lin_regr.fit(X_train, y_train)

ex = shap.KernelExplainer(lin_regr.predict, X_train)
shap_values = ex.shap_values(X_test.iloc[0, :])

ex_filename = 'explainer.bz2'
joblib.dump(ex, filename=ex_filename, compress=('bz2', 9))
ex2 = joblib.load(filename=ex_filename)
shap_values2 = ex2.shap_values(X_test.iloc[0, :])

assert np.array_equal(shap_values, shap_values2) 

dejori avatar Nov 13 '18 02:11 dejori

Hi, sorry I can see my reply got lost.

I do not get an error with the above example and I have tried to recreate my own error but unsuccessful. I apologise for this.

MadsJensen avatar Dec 13 '18 07:12 MadsJensen

Which kind of explainer are you trying to save? TreeExplainer used on a LightGBM or XGBoost model keeps a reference to the original model object, which is probably not python serializable.

Understood. But having to run shap.explainer over large amount of data every time when I restart the Kernel is a bit of pain. Do you think there is any other way out?

DSXiangLi avatar Mar 17 '19 02:03 DSXiangLi

@DSXiangLi I have a bit of an obsession with saving all the work every step of the way, because (as you said) running the explainer at every kernel restart is a bit of a pain, especially with large data. I use a combination of jsonpickled sklearn models and pandas dataframes, all combined into an h5 file. I also do a silly thing where I package up some key metadata into pd dataframes, just so it's all together (e.g. the SHAP expected value, the classification threshold I used, the name of what I was trying to predict).

jsonpickle is a nice library for this, especially if you are working with the sklearn versions of, e.g., LightGBM, which do not have JSON output built in and cannot be stored in h5 without some kind of compatible serialization.

Here's an example for saving the SHAP values (doesn't require jsonpickle). I save more than this, but the basic syntax is the same:

First, generate the explainer:
        explainer = shap.TreeExplainer(gbm_model)
        features_shap = features.sample(n=20000, random_state=seed, replace=False)
        shap_values = explainer.shap_values(features_shap)
        shap_expected = explainer.expected_value

    def shap_save_to_disk(self):
        print("Saving SHAP to .h5 file...")
        h5_file = self.h5_file
        shap_val_df = pd.DataFrame(self.shap_values) #this step is unnecessary, could just save np arrays directly, but the df have some advantages and I don't mind reconverting upon loading
        shap_feat_df = pd.DataFrame(self.features_shap)
        # define what goes in the first row with "d"
        d = [[self.target, self.name_for_figs, self.shap_expected, self.class_thresh]]
        exp_df = pd.DataFrame(
            d, columns=("target", "name_for_figs", "shap_exp_val", "class_thresh")
        )
        shap_val_df.to_hdf(h5_file, key="shap_values", format="table")
        shap_feat_df.to_hdf(h5_file, key="features_shap", format="table")
        exp_df.to_hdf(h5_file, key="shap_expected_value", format="table")

Here's an example for saving the model (requires jsonpickle), also overwrites any old models in the same file:

    def lgbm_save_model_to_h5(self):
        print("JSONpickling the model...")
        frozen = jsonpickle.encode(self.gbm_model)
        print("Saving model to .h5 file...")
        h5_file = self.h5_file
        with h5py.File(h5_file, 'a') as f:
            try:
                f.create_dataset('self.gbm_model', data=frozen)
            except Exception as exc:
                print(traceback.format_exc())
                print(exc)
                try:
                    del f["gbm_model"]
                    f.create_dataset('self.gbm_model', data=frozen)
                    print("Successfully deleted old model and saved new one!")
                except:
                    print("Old model persists...")
        print(h5_file)

cbeauhilton avatar May 22 '19 11:05 cbeauhilton

Hi, I have created a GradientExplainer. What is the best way to save this explainer? thank you.

DarioBernardo avatar May 25 '19 18:05 DarioBernardo

Hi @DarioBernardo ,

I haven't used the GradientExplainer much, but from the documentation it looks like it returns either tensors, lists of tensors, or a pair of a list of tensors and a matrix.

        Returns
        -------
        For a models with a single output this returns a tensor of SHAP values with the same shape
        as X. For a model with multiple outputs this returns a list of SHAP value tensors, each of
        which are the same shape as X. If ranked_outputs is None then this list of tensors matches
        the number of model outputs. If ranked_outputs is a positive integer a pair is returned
        (shap_values, indexes), where shap_values is a list of tensors with a length of
        ranked_outputs, and indexes is a matrix that tells for each sample which output indexes
        were chosen as "top".

You could try h5 files, and I think these are standard fare in several DL libraries, but ymmv.

cbeauhilton avatar May 27 '19 12:05 cbeauhilton

Hi @cbeauhilton , thank you so much for your answer. Where did you find the documentation? I looked here, and I couldn't find anything about GradientExplainer, serialisation or saving models. Regarding your answer, I am not sure I understand. I want to save the trained GradientExplainer object. I have the following piece of code

to_explain = test_data[224:226]

def map2layer(x, layer):
    feed_dict = dict(zip([model.layers[0].input], [x]))
    return keras.backend.get_session().run(model.layers[layer].input, feed_dict)


layer_number = 11
e = shap.GradientExplainer(
    (model.layers[layer_number].input, model.layers[-1].output),
    map2layer(test_data[0:200], layer_number),
    local_smoothing=0 # std dev of smoothing noise
)
shap_values,indexes = e.shap_values(map2layer(to_explain, layer_number), ranked_outputs=1)`

Building the GradientExplainer (especially on large data) takes long time. I don't want to build it every time, I want to build it once, save it, and call shap_values many times. I am not sure where you got the documentation, but my object e does not return what is in your snippet, it looks more what is returned by shap_values. Is there a way to save my e object?

DarioBernardo avatar May 28 '19 09:05 DarioBernardo

Hi @DarioBernardo , glad to help!

I found the documentation here, starting around line 102: https://github.com/slundberg/shap/blob/master/shap/explainers/gradient.py

I get what you mean about wanting to call shap_values many times and compute only once. If the return from e similar to what is returned by shap_values in other cases, i.e. a numpy array, you can write these any number of ways. I like the h5 format because I can shove a ton of stuff in it and you can write to them in ways that work in many environments. If you don't care about portability beyond your Python workspace, pickles work for almost everything without much fuss, as they can take arbitrary Python objects (almost) all of the time. See what happens if you do something like:

try:
    import cPickle as pickle
except BaseException:
    import pickle

...

file_path0 = path/to/wherever/you/want/your/val_file.pkl
file_path1 = path/to/wherever/you/want/your/index_file.pkl

with open(file_path0, "wb") as f:
    pickle.dump(shap_values, f)

with open(file_path1, "wb") as f:
    pickle.dump(indexes, f)

and then load with something like:

with open(file_path0, 'rb') as f:
     shap_values = pickle.load(f)

etc.

EDIT: Just poked around a little more, pickle might not work well with tensors. If the above doesn't work, try the h5 option, using something like what is found here (it's for numpy arrays, but the basic syntax is the same): https://stackoverflow.com/questions/20928136/input-and-output-numpy-arrays-to-h5py. TensorFlow and others use h5 files for saving tensors, so hopefully it will play nicely.

Also, if the pickle option doesn't work, let me know and I'll clean up this comment so it doesn't confuse anyone in the future.

cbeauhilton avatar May 30 '19 12:05 cbeauhilton

Hi @cbeauhilton , thanks for helping out. Unfortunately both pickle and h5 don't work. Here is my code:

layer_number = 11
e = shap.GradientExplainer(
        (model.layers[layer_number].input, model.layers[-1].output),
        map2layer(test_data[1:20], layer_number),
        local_smoothing=0 # std dev of smoothing noise
    )

with open('explainer.pkl', "wb") as f:
    pickle.dump(e, f)

When I try this code I get TypeError: can't pickle _thread.RLock objects . I have also tried with h5py, but it also doesn't work, but I may miss something here, as I couldn't find the right way to serialise a generic object (rather than a dataset), anyway here is my code:

layer_number = 11
e = shap.GradientExplainer(
        (model.layers[layer_number].input, model.layers[-1].output),
        map2layer(test_data[1:20], layer_number),
        local_smoothing=0 # std dev of smoothing noise
    )

h5f = h5py.File('explainer.h5', 'w')
h5f.crecreate_dataset('explainer', data=e)
h5f.close()

the error I get is

dataset.py, line 118, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1630, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1707, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

Thanks

DarioBernardo avatar May 31 '19 09:05 DarioBernardo

Bummer! I'll clean up my old comment when I'm off mobile. Maybe I'll get a chance to play with this soon, but I haven't had a project with a need for the GradientExplainer yet.

What about pickling/h5-ing the desired output of e, rather than e itself? I.e. the shap values and indexes.

cbeauhilton avatar May 31 '19 19:05 cbeauhilton

I am trying to avoid to recreate the GradientExplainer every time I need to get shap values and indexes. I am using this within an application, not a jupyther notebook. Keras offers a .savemodel() which internally translate the state into h5 and a loadmodel() for the opposite operation, maybe I could open a ticket to ask for this feature? @slundberg

DarioBernardo avatar Jun 01 '19 08:06 DarioBernardo

@DarioBernardo Do you need the GradientExplainer itself? Or just the output? I can't think of any situations in which I'd be more interested in the GradientExplainer (e) than its output (shap_values, indexes). Did you try saving shap_values and indexes, rather than e?

I don't use jupyter notebooks very often either, just vanilla .py files, and have had success saving the output of my explainers (but not the explainers themselves) to disk and polling this output whenever needed. I discard the explainer after getting the output.

cbeauhilton avatar Jun 04 '19 15:06 cbeauhilton

Hi @cbeauhilton , yes I do need the GradientExplainer itself. I have a script that loads a keras model, run classification and return the classification. I want to run along with this model, the GradientExplainer and return the classification and the explanation. I have a new picture, I run the script, the script loads both the model and the GradientExplainer and return both the classification and the explanation. I don't want to train the explainer every time I need a new explanation. It would take too long. I am aware I can save the explanation later, but I want to reduce the time it takes to produce the explanation. How can I do this? From the following code

e = shap.GradientExplainer(
        (model.layers[layer_number].input, model.layers[-1].output),
        map2layer(test_data, layer_number),
        local_smoothing=0 # std dev of smoothing noise
    )

if test_data is very large, creating the GradientExplainer will take too long .

DarioBernardo avatar Jun 04 '19 16:06 DarioBernardo

how important it is to have a large number of training examples when building the GradientExplainer? More data, more accurate explanation? Does this impact the amount of memory used? I am also having problem with the memory used.

DarioBernardo avatar Jun 10 '19 09:06 DarioBernardo

Hi @DarioBernardo a few thoughts:

  1. Are you using PyTorch or TensorFlow?
  2. What version of shap are you using?
  3. There should not be much time spent constructing the GradientExplainer, because it does not train anything during construction. At least for TF all the computation graphs are built lazily I think.
  4. Having more background samples than nsamples is not needed. But for GPU memory you might have better luck reducing the batch_size parameter.

Hope that helps!

slundberg avatar Jun 19 '19 22:06 slundberg

Hi @slundberg , thank you for your answer. Here is to your points.

1 - Tensorflow 2 - shap==0.29.1

4 - If I understand correctly, I can reduce the batch_size param when I init the gradient explainer, great, I will try that. Stupid question, what do you mean by nsamples ? Are you referring to the number of images I am trying to explain? I am trying to explain just one image, does this mean that the bath_size and the test_data (from code in previous comments) can be just 1? In your code snippet example in the project home the Gradient Explainer is called on the whole X , containing the whole imagenet50. I am a bit confused. Thanks a lot.

DarioBernardo avatar Jun 20 '19 08:06 DarioBernardo

Ah, nsamples is just the number of gradient evaluations GradientExplainer runs and is an argument to the shap_values function of GradientExplainer. GradientExplainer uses a sampling approach that needs a few hundred samples usually (by default 200). Running all 200 samples at the same time could run you out of memory, so samples are run in batches. Batch size is 50 by default but you could set it to 10 to reduce your GPU memory needs by a factor of 5.

slundberg avatar Jun 22 '19 22:06 slundberg

@DarioBernardo Did you find a way to save the GradientExplainer? I am using the latest version of SHAP (0.35.0) and I get the same error TypeError: can't pickle _thread.RLock objects

manjiler avatar Mar 15 '20 13:03 manjiler

Hi @slundberg, Thanks for the useful method! I was wondering how to save deepExplainer?

sepidehhosseinzadeh avatar Apr 08 '20 15:04 sepidehhosseinzadeh

Hi @DarioBernardo Did you find a way to save your gradientexplainer? I'm in the same situation with deepexplainer, same error on pickle and h5.

I see the same error when trying to pickle a tf.keras sequential so I suppose the errors are related. I see here (https://github.com/tensorflow/tensorflow/issues/34697) they're talking about workarounds for pickling the models themselves. The workarounds work for me for the models, but not the explainers. I don't understand the internals well enough unfortunately, but perhaps someone here can figure out if theres a way to apply those fixes to the explainers? :D

cheers

philmassie avatar Jul 10 '20 12:07 philmassie

Hi @philmassie , no unfortunately didn't find a solution to this problem. I agree, as suggested in the linked discussion you can pickle the model, but it's different form saving this specific type of explainer.

DarioBernardo avatar Jul 10 '20 14:07 DarioBernardo

Sorry to re-bump - was wondering & wanted to clarify, is there interest from the maintainers in having explainer persistence APIs (saving/loading) added to SHAP? We're interested in adding autologging of SHAP-generated plots and explainers to MLflow (https://github.com/mlflow/mlflow), where it seems useful to persist explainers to allow computing explanations on new data in the future.

It sounds it may be fundamentally difficult as in many cases SHAP explainers need to maintain a reference to the original model, which may not be picklable - is that right?

Thanks!

smurching avatar Sep 09 '20 18:09 smurching

@smurching great question. Short answer, yes, serialization would be great and is tricky sometimes (as this issue highlights).

Explainers do indeed need to maintain a reference to the original model since they need to execute it during the explanation computation (the exception being some tree and linear explainer settings which load the entire model into SHAP and discard the original object).

Support in MLFlow would be great, and should probably align with the API that is in 0.36.0 (the API is still getting a last bit of tire-kicking before we update the docs, hence the expectation people are still using the previous API). The new API makes every explainer a subclass of shap.Explainer, and introduces a new explanation object shap.Explanation that allows nice parallel slices (see https://github.com/slundberg/shap/blob/master/notebooks/plots/bar.ipynb for example). We worked to make the baseline version of this new API jointly work with the InterpretML project, so that other packages can also support the same API.

So to make serialization happen we will want to serialize either explainers and explanations.

slundberg avatar Sep 11 '20 20:09 slundberg

Hi, I saved a lgbm model, when someone else loads the saved model and use it for TreeExplainer, the shap values end up being different for a given test data. Is there something that can be done so that whoever uses the save model get the same shap values for the same test data?

lgbm_model.params['objective'] = 'binary'
explainerLGBM=shap.TreeExplainer(lgbm_model)
shap_values_LGBM= explainerLGBM.shap_values(preds_df)

sosyete27 avatar Sep 18 '20 16:09 sosyete27

@slundberg So we still cannot save a trained DeepSHAP nor GradientSHAP explainer? I think this would create inconsistency issues for online inference. Since each time, we use different background data to train the explainer, and then do the inference.

shirley020 avatar Apr 27 '21 19:04 shirley020

Hi, I saved the SHAP explainer and the Shapley values using pickle, a module that serialize/deserialize Python object structures. It's really fast to save and load the SHAP's objects.

from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators=100, max_samples='auto', contamination='auto', max_features=13,                           bootstrap=False, n_jobs=-1, random_state=42)
iforest.fit(X)

explainer = shap.Explainer(iforest.predict, X)
shap_values = explainer(X)

import pickle
filename_expl = 'explainer.sav'pickle.dump(explainer, open(filename_expl, 'wb'))`
load_explainer = pickle.load(open(filename_expl, 'rb'))
print(load_explainer)

filename = 'shapvalues.sav'
pickle.dump(shap_values, open(filename, 'wb'))
load_shap_values = pickle.load(open(filename, 'rb'))
print(load_shap_values)```




eugeniaring avatar Jun 03 '21 12:06 eugeniaring

Hi, I saved the SHAP explainer and the Shapley values using pickle, a module that serialize/deserialize Python object structures. It's really fast to save and load the SHAP's objects.

from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators=100, max_samples='auto', contamination='auto', max_features=13,                           bootstrap=False, n_jobs=-1, random_state=42)
iforest.fit(X)

explainer = shap.Explainer(iforest.predict, X)
shap_values = explainer(X)

import pickle
filename_expl = 'explainer.sav'pickle.dump(explainer, open(filename_expl, 'wb'))`
load_explainer = pickle.load(open(filename_expl, 'rb'))
print(load_explainer)

filename = 'shapvalues.sav'
pickle.dump(shap_values, open(filename, 'wb'))
load_shap_values = pickle.load(open(filename, 'rb'))
print(load_shap_values)```

Thanks @eugeniaring! that solution was the best for save the SHAP values!

NicolasCHG avatar Jul 12 '21 00:07 NicolasCHG