sklearn-evaluation icon indicating copy to clipboard operation
sklearn-evaluation copied to clipboard

Add Feature Importance Tracking

Open hcho1111 opened this issue 3 years ago • 2 comments

If models are run in parallel, we should try and include additional visualizations for evaluating feature importances. Local and Global feature importances can be calculated using SHAP values. From my limited understanding, one thing that sklearn does not go into significant depth with interpretability/feature importances. Perhaps we could change that/implement some additional comparison metrics.

We can also try and incorporate specific visualizations if we use LASSO/ridge regression models. These are just some thoughts that popped into my head, but will comment below when I have fleshed out ideas my ideas.

hcho1111 avatar Oct 13 '22 20:10 hcho1111

That's a pretty good idea, @edublancas thoughts?

idomic avatar Oct 14 '22 11:10 idomic

sure. this sounds good. do you wanna work on it? feel free to do some research on which feature importance methods you think we should incorporate, we can discuss them and then you can work on the implementation

edublancas avatar Oct 14 '22 19:10 edublancas

Would love to work on this. If we go the way of using SHAP values, there is a disadvantage that arises from the number of dependencies that would have to be installed.

Just from a quick check between ploomber's and sklearn-evaluation's dependencies, I came up with an upper bound (~22MB) on the packages that would need to be installed. In order of size:

  • numba (19MB)
  • SHAP (2.5 MB - main package, all others are dependencies)
  • cloudpickle (124KB)
  • slicer (112KB)

The largest benefit is that installing the shap package would eliminate the need for many complex calculations. It would also give us the benefits of the extensive visual suite that the package has to offer.

With this being said, I know that we should shy away from installing additional dependencies when possible. I argue that, in terms of model interpretability and feature importance metrics, this would be a good addition to our current tools: SHAP values are widely used for modeling and (to the best of my knowledge) are gold standard in industry. Would love to hear your thoughts.

hcho1111 avatar Oct 24 '22 18:10 hcho1111

I'm okay with it if they're optional dependencies.

I'm unsure about the implementation. What are you thinking of building? I've used the shap package before and I remember it has some plotting capabilities, is this something that would replace shap or enhance it?

edublancas avatar Oct 24 '22 19:10 edublancas

I was thinking that we could start out by incorporating it in the make_report function or the general Report object. I think that it would be interesting to include reports on 1) global feature importances and 2) a sample of local feature importances. The second part is somewhat arbitrary because it would depend on the single point that we would analyze, but I'm sure we could come up with some sort of way to 'randomly' generate plots (we could perhaps just list the points we want).

Long way around answering your question, I think we could start with a base implementation as part of the package and then build from there.

hcho1111 avatar Oct 24 '22 20:10 hcho1111

I'm assuming these plots you're thinking of are not already available on the SHAP package?

let's start by creating some new plots in the sklearn_evaluation.plot module

edublancas avatar Oct 24 '22 20:10 edublancas

Closing due to inactivity

idomic avatar Jan 12 '23 19:01 idomic