shap icon indicating copy to clipboard operation
shap copied to clipboard

TreeExplainer on LightGBMClassifier returns 2D array of shap values in binary classification case

Open imatiach-msft opened this issue 6 years ago • 16 comments

For the binary classification case, when using TreeExplainer with scikit-learn the shap values are in a 3D array where the 1st dimension is the class, the 2nd dimension rows and the 3rd dimension columns. However, when using LightGBMClassifier in binary classification case a 2D array is returned (just rows/columns, no negative/positive classes). For multiclass case LightGBMClassifier also returns a 3D array, with a 2D array of shap values for each class. While this is fine from a correctness perspective since we just need to take the negative of the 2D array for the negative class, eg return: shap_values = [-binary_shap_values, binary_shap_values] This is inconsistent with what the other binary classification learners return, eg scikit learn. It looks like the issue may need to be fixed in lightgbm native code and not shap. Was there a specific reason that the API is inconsistent here - and what would be the preferred fix? Should this be fixed in lightgbm's codebase or shap's codebase?

imatiach-msft avatar Apr 02 '19 19:04 imatiach-msft

I was thinking of adding this fix directly to https://github.com/Microsoft/LightGBM, but I wanted to make sure the issue is actually in the native code there and not shap.

imatiach-msft avatar Apr 02 '19 21:04 imatiach-msft

Thanks for pointing this out. This is the same with XGBoost and was originally driven by the format there. I think the "right" convention is to match the output of the model. So if the model object outputs a 2D array when applied to a dataset then the shap_values should be 3D (a list of 2D arrays). But if the model object outputs a 1D array then the shap_values should be 2D.

The tricky part with LightGBM (and XGBoost) is that they act differently when you use the sklearn API vs. the native API. So this would mean we need to remember what kind of object we were given and then correctly expand to the 3D case when we are given an sklearn API object. This could be done inside LightGBM or inside SHAP.

Thoughts?

slundberg avatar Apr 02 '19 22:04 slundberg

@slundberg sorry, to be sure I understand, by output 1D or 2D array you mean the predicted value (regression) or predicted probabilities per class (classification): "But if the model object outputs a 1D array then the shap_values should be 2D." I was actually thinking more in terms of consistency with the scikit-learn model feature importances, which are always 3D for binary/multiclass classification and 2D for regression. I can see why you would look at consistency with the predicted values though. This seems like a bug to me: "LightGBM (and XGBoost) is that they act differently when you use the sklearn API vs. the native API" It's bizarre that they output values differently. However, if the LightGBM/XGBoost APIs for returning feature importances have to return a 2D array for binary classification case (similar to regression) but they do return a 3D array for multiclass for some consistency reason, then at the very least we should reformat the output in shap TreeExplainer to be consistent with scikit learn based models output. However, I don't think that is set in stone, in which case the fix should be in LightGBM/XGBoost. If you think that we however shouldn't change the original LightGBM/XGBoost repository for consistency, it would be a really easy fix to recognize that the original learner is a binary classifier and reformat the shap values in TreeExplainer as simply [-LightGBM_shap_values, LightGBM_shap_values]. Please let me know if it's better to make the fix in shap or the original LightGBM/XGBoost repository.

imatiach-msft avatar Apr 03 '19 16:04 imatiach-msft

@slundberg sorry to be persistent about this issue, but I would really like to resolve this in either shap or lightgbm/xgboost.

This issue seems like a bad experience and makes the API inconsistent/confusing to have the regression format of shap values output for binary classification just for some subset of tree-based models. Unless there is a really good reason for it, which so far it sounds like there isn't, it looks like the most ideal change would be to modify lightgbm/xgboost feature importances for binary classification case.

It sounds like we are leaning towards making the changes in lightgbm and xgboost repositories then?

imatiach-msft avatar Apr 11 '19 16:04 imatiach-msft

@imatiach-msft please be persistent :)

The issue is that when using the sklearn API we get an output that is 2D:

import shap
import xgboost
X,y = shap.datasets.adult()

model1 = xgboost.XGBClassifier()
model1.fit(X, y)
model1.predict_proba(X).shape

outputs: (32561, 2)

but with the standard API we get a 1D array:

model2 = xgboost.train({"objective": "binary:logistic"}, xgboost.DMatrix(X, y), 1)
model2.predict(xgboost.DMatrix(X)).shape

outputs: (32561,)

However we don't match this with shap right now since both

shap.TreeExplainer(model1).shap_values(X).shape

and

shap.TreeExplainer(model2).shap_values(X).shape

are both (32561, 12)

I agree we need to fix shap.TreeExplainer(model1).shap_values(X).shape to be 3D. Are you arguing that shap.TreeExplainer(model2).shap_values(X).shape should also be 3D?

slundberg avatar Apr 11 '19 17:04 slundberg

@slundberg I think it would make sense for all shap values from TreeExplainer, no matter what the shape of the prediction probabilities are, to be consistent based on whether we are doing binary classification/multiclass classification/regression. In an ideal world all models would have the (32561, 2) shape of output. We can't fix that, but TreeExplainer should output shap values per class even in binary classification case when using xgboost.train. Otherwise, those who depend on the shap TreeExplainer will need to special-case their code depending on the given input model. However, if you feel strongly that models which, for binary classification case, output only a 1D array should only have 2D array shap values, maybe that is fine. The case that definitely does need to be fixed is the model1 issue then:

shap.TreeExplainer(model1).shap_values(X).shape

So I think there are 2 levels of bugs. The model1 case definitely needs to be fixed, and the model2 case is up for debate. I think they are both bugs but at least for model2 there is some reasoning behind the shap values output, so I'm not as persistent about fixing it.

imatiach-msft avatar Apr 11 '19 17:04 imatiach-msft

Makes sense. I won't be able to fix it until mid next week (Wed), but I agree that the first issue is clearly a bug. As for the second, let me get back to you.

slundberg avatar Apr 12 '19 05:04 slundberg

@slundberg I was actually hoping to fix this issue and just wanted some guidance as to what the ideal fix should be. I will look into the first issue to see if I can fix it easily, would appreciate any help. Do you think the fix for the first issue should be in xgboost/lightgbm code or in shap TreeExplainer?

imatiach-msft avatar Apr 12 '19 15:04 imatiach-msft

Great thanks! The first fix should certainly start with just TreeExplainer, since it just has to do with what type of object we were passed. Perhaps the second issue might involve the external packages.

slundberg avatar Apr 12 '19 16:04 slundberg

Is this issue fixed shap.TreeExplainer(model1).shap_values(X).shape to be 3D?

Sreemanto avatar Aug 31 '20 08:08 Sreemanto

Yes, this issue is fixed for lightgbm in the newer versions of shap. However, it still exists for XGBoost. I have a PR to fix that too:

https://github.com/slundberg/shap/pull/1046

imatiach-msft avatar Aug 31 '20 14:08 imatiach-msft

Hi all. Are you sure the issue is fixed? I am still getting this as of today with 0.39.0 on LightGBM.

lctdulac avatar Aug 31 '21 11:08 lctdulac

Hi guys, I suppose I am getting a similar issues as you do when generating shap values with LightGBM Binary Classifier:

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

Package versions I am using:

shap.__version__: 0.39.0
lgb.__version__ : 3.2.1

ibuda avatar Sep 07 '21 15:09 ibuda

I too came upon this warning now. It's rather confusing, since the whole warning formatting is removed here in the shap code.

Other packages that rely on shap print this warning with no clue to its origin, not does it explain if one must/should do something about it.

image

Should I be doing something about it?

thomasaarholt avatar Jul 19 '22 21:07 thomasaarholt

Im getting this issue too "LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray" But XGBoost doesnt give this error

AbdulAlim8660 avatar Aug 26 '22 14:08 AbdulAlim8660