probatus icon indicating copy to clipboard operation
probatus copied to clipboard

Remove or rebuild ShapInspector

Open Matgrb opened this issue 3 years ago • 2 comments

Issue Description Currently in interpret module we have shap_inspector.py, which is old code, that allows to cluster shap values in the data, and then present the analysis for each subgroup.

We have two options:

  • Remove the feature and unit tests completely
  • Rebuild it as SHAPClustersInterpreter or SHAPDatasetInterpreter, which would essentially do similar thing, but with the consistend probatus API, and we could improve the type of information we return for the report and plots e.g. we could add better descriptive statistics for each cluster, and extend it with the AUC of the model for each group of the data, but also allow plotting shap summary or feature importances plots for specific groups of data only.

Personally, I think analysing SHAP clusters is a great idea, and i haven't seen it done anywhere else. This could be one of the features we have on our roadmap. The main issue would be putting the effort to implement it properly, in usable way, and with informative plots and analysis.

@timvink @sbjelogr @anilkumarpanda What do you think about this idea? What type of other analysis would you add to this feature?

Matgrb avatar Mar 26 '21 11:03 Matgrb

You're referring to this class correct?

https://github.com/ing-bank/probatus/blob/d4830112cb880565b1f474c901a8c5ea39623e96/probatus/interpret/inspector.py#L139

I agree it's an interesting approach. I'm not yet convinced it's an interesting approach from a user perspective (data scientist building a model). When do you use this? What kind of things can you find? How do you act on them?

Without these questions clearly answered (in the documentation towards users), I doubt this feature fits the probatus principle that 'any tool that we build should be useful for a broad range of users'.

If you choose to remove the code, perhaps @sbjelogr or @Matgrb can move the code to a repo in their own space. If the approach matures, we can put it in probatus later.

timvink avatar Mar 26 '21 13:03 timvink

I suppose if you perform such analysis you can find:

  • Groups in your data that affect the model in a certain way e.g. these samples cause prediction mainly of class 0, and AUC, or other metrics for these samples is higher.
  • There are groups in the data, for which the model is uncertain, and for them AUC is much lower.

Clustering of shap values instead of values in the data has the benefit that you cluster the data by the effect it has on the fitted model, and therefore, you grasp the correlations with the target. Different clusters could represent stronger relation to a given class, but also different features that work together while predicting a given class.

You can further investigate these groups further using e.g. summary plot, and understand what relations in the data actually are driving the prediction and performance.

One drawback of this is difficulty of analysis. I think it is really difficult to get meaningful conclusions from this analysis, you have to spend time investigating the samples from a given cluster. That is why indeed it will be useful to a very narrow range of users, unless we put a lot of effort in understanding the tool well ourselves, present it to the users like e.g. resemblance model and then document it well.

A related problem is that this tool is not "simple", which is another principle that we try to stick in probatus. First you need to train a model, then evaluate it, then get shap values, then cluster them, and then generate outputs.

Let's see what @sbjelogr thinks.

Matgrb avatar Mar 26 '21 14:03 Matgrb