shap ENH: increase transparency of background dataset sub-sampling

Issue Description

Given $x$ which is the sample that we wish to explain, we can compute the Shapley values of that sample using a background sample $x^b$. By providing the Explainer class with background data, it should compute the Shapley values for each sample in the background the background data and then take the average, which will be an approximation to the interventional SHAP.

The averaging procedure means that if I for example split my background data in half, A and B, then I should be able to call explainer on both A and B to obtain the averaged SHAP, a and b, for each half. If I now take (a + b)/2, then this should equal calling SHAP on the entire dataset to begin with.

From my experimentation, it seems that if the background dataset is over 100 samples, then it becomes inconsistent i.e. (a+b)/2 is not equal to the interventional approximation on the entire background dataset. However, the formula holds for datasets under 100 samples.

Minimal Reproducible Example

import numpy as np
import pandas as pd
import xgboost

import shap

rng = np.random.default_rng(42)
N = 1000
M = 2

X = rng.standard_normal(size=(N, 2))
X[:, 0] = 0.2*X[:, 1] + X[:, 0]
y = -2*X[:, 0] + X[:, 1] + 0.5*X[:, 0]*X[:, 1]

X = pd.DataFrame(X, columns=["X1", "X2"])


model = xgboost.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=3)
model.fit(X, y)


def get_shap_values(model, X, sample):
    explainer = shap.TreeExplainer(
        model,
        X,
        feature_perturbation="interventional",
    )
    explanation = explainer(sample)

    expected_value = explanation.base_values[0]
    shap_values = explanation.values[0]
    return shap_values, expected_value

# Consistent for when the background data has 100 or less samples


for i in range(50, 53): # i is number of samples in each half 
    midpoint = i
    double_mid = midpoint * 2
    # shap on two halves
    shap_values1, expected_value1 = get_shap_values(model, X.loc[1:midpoint, :], X.loc[[0], :])
    shap_values2, expected_value2 = get_shap_values(model, X.loc[(midpoint+1):double_mid, :], X.loc[[0], :])
    # Shap on full background data
    shap_values, expected_value = get_shap_values(model, X.loc[1:double_mid, :], X.loc[[0], :])

    print(len(X.loc[1:midpoint, :]), len(X.loc[(midpoint+1):double_mid, :]), len(X.loc[1:double_mid, :]))
    print(shap_values, (shap_values1 + shap_values2) / 2) # inconsistent here when i > 50

Traceback

No response

Expected Behavior

In the for loop the values should equal each other

Bug report checklist

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest release of shap.
[X] I have confirmed this bug exists on the master branch of shap.
[ ] I'd be interested in making a PR to fix this bug

Installed Versions

0.44.0

Jan 18 '24 10:01 jyliuu

Thanks for the report and your effort to investigate this. Your description is absolutely accurate and the reason for this is the default in the tabular masker.

Here is an issue where this problem was already discussed including workaround: https://github.com/shap/shap/issues/3174.

We probably should throw at least a warning if max_samples < len(X). What do you thing @connortann ? This issue seems to come up and is confusing users.

Jan 20 '24 10:01 CloseChoice

I agree with your analysis, this seems to be a consequence of sampling. I'll remove the bug label as I think this is intended behaviour.

We probably should throw at least a warning if max_samples < len(X)

I'm not sure if I agree. To me, warnings are generally used to indicate undesirable situations in which the user should probably update their code to fix the warning. In this case I think for the majority of users the subsampling is expected and desirable behaviour. Many parts of shap are sampling-based and only offer approximate results.

Would log.info() be more appropriate?

Jan 23 '24 19:01 connortann

logging.info is fine for me as well. I would be fine with a print as well, just to make sure that users do not have to investigate a couple hours to find the reason for the inconsistency between values and theory

Jan 23 '24 20:01 CloseChoice

I would much prefer logging over print statements, as prints are much harder to configure and disable. I think adding a print would risk annoying a large majority of shap users.

I've renamed the title accordingly to reflect the plan.

Jan 24 '24 13:01 connortann

I am also confused about the background dataset and would like to ask a follow-up question, if I may.

Suppose I use shap.TreeExplainer to explain predictions from my LightGBM model for a classification task. I am interested in model_output = "probability", so according to the documentation, I need to set feature_perturbation="interventional" and specify a background dataset. Given that I have training data, validation data, and test data, where should I pick the background dataset from - training, validation, or test? It says in the documentation that "Anywhere from 100 to 1000 random background samples are good sizes to use", how should I pick the samples? Should I fix the random samples so that the background dataset won't change regardless of the dataset (train, validation, test) I use?

May 11 '24 02:05 jcoding2022

This is not strictly on topic, so if you have follow up questions to my answer please open a discussion or search for one of the topics where this is already discussed.

First, I do not believe there is a real answer to your question, there is no real backtesting one can do for shap values, etc. So one just has to take various considerations into account:

do you want to have deterministic shap values? If yes, then fixing the background dataset makes sense.
The background dataset is just used to calculate the baseline, so any number where this average is seen to converge should be sufficiently large. You can try to test if this is the case by keeping the dataset to explain constant and change the background dataset and check how large the differences in the shap values (or even simpler: just in the the expected value) are. For iid sampling and a sufficiently diverse background dataset a number of 100 to 1000 should suffice, I wouldn't expect much differences between train, test or validation. If so, that I would rather check if your splitting is chosen correctly.

May 11 '24 07:05 CloseChoice