MLJ.jl Feature importance / model inspection

Feature importance / model inspection

Open baggepinnen opened this issue 4 years ago • 12 comments

It would be nice to have some integrated tools for model inspection and feature importance (FI). Below are some links to resources and what's available in scikit learn.

Scikit learn exposes a number of tools for understanding the relative importance of features in a dataset. These tools are general in the sense that they can be made to work with many different kinds of models. They are organized in a module called "Inspection" which I find fitting, since they all allow the user to somehow understand or inspect the result of fitting a model in other ways than simply measuring the error/accuracy. Some of them are linked below

Permutation FI in scikit. This one is very general, randomly shuffle one column (feature) and see how much worse the model performns.
random forest FI in scikit
partial dependence plot this is not really a FI tool, but nevertheless an interesting tool to have to understand the result of the fitted model.
Drop Column FI This one is simple, train a model without a feature and see how much worse it gets.
LIME A general and popular method model-agnostic tool for understanding the impact of perturbing inputs.
A Unified Approach to Interpreting Model Predictions. A recent paper on model inspection, going through LIME and other related methods.

Dec 20 '19 00:12 baggepinnen

Thanks for this.

For bagging ensembles it's reasonably straightforward. Some models we interface with also have it (e.g. XGBoost) and so it can just be part of the interface (via report), I'll have a look into this for xgb.

Support for permutation/drop FI seems reasonably easy, there's just a question as to where the implementation would go, maybe a comparable "model(s) inspection" module or package in MLJ or something of the sorts.

The rest of your suggestions are a bit trickier. LIME is nice but basically is an entire package; like shap which is quoted in the article at your last point.

Dec 20 '19 13:12 tlienart

Feature importance is an interesting one because most of the measures out there are rather ad-hoc and model dependent. That is, the very definition of feature importance depends on the model (eg, absolute value of a coefficient in a linear model makes no sense for a decision tree). And for certain models, eg trees and random forests, there are several inequivalent methods in common use. The paper cited above on shap describes an approach that is really model independent; unless someone is aware of another such approach, I suggest any generic MLJ tool follow that approach. There is already some implementation of SHAP in python, if I remember correctly.

Dec 30 '19 20:12 ablaom

The recently created https://github.com/nredell/ShapML.jl may also be a very nice add (already compatible with MLJ as far as I can see) cc @nredell

Feb 02 '20 15:02 tlienart

My plans for ShapML can be found on the Discourse site--https://discourse.julialang.org/t/ml-feature-importance-in-julia/17196/12--, but I'm posting here for posterity sake.

Just sitting down for the first refactor/feature additions today. I'll code with these guidelines in mind (https://github.com/invenia/BlueStyle) as well as take a trip through the MLJ code base. And if a general feature importance package pops up in the future, I wouldn't be opposed to helping fold ShapML in if it's up to par and hasn't expanded too much by then.

Feb 17 '20 06:02 nredell

cc @sjvollmer (for summer FAIRness student, if not already aware)

Apr 29 '20 02:04 ablaom

My current inclination is to see if this can be satisfactorily addressed with third party packages, such as the Shapley one. A POC would make a great MLJTutorial.

If something more integrated makes sense, though, l'm interested to here about it.

Apr 29 '20 02:04 ablaom

Any update on feature importance integration

Jul 02 '21 11:07 vishalhedgevantage

@vishalhedgevantage There are some GSoC students working on better integration of interpretable machine learning (LIME/Shapley). And there is this issue, which I opened to support recursive feature elimination. However, the volunteer who had expressed an interest in the latter must of got busy with other things...

Jul 04 '21 21:07 ablaom

For bagging ensembles it's reasonably straightforward. Some models we interface with also have it (e.g. XGBoost) and so it can just be part of the interface (via report), I'll have a look into this for xgb.

Any movement on this?

Aug 09 '21 14:08 Moelf

@Moelf Feel free to open a request to expose feature importance at XGBoost.jl.

To be honest, current priorities for MLJ favour pure julia solutions. I'm pretty sure EvoTrees.jl (which has an MLJ interface) exposes feature importances in the report. Perhaps you want to check out that well-maintained tree-boosting package.

Aug 12 '21 01:08 ablaom

I think XGBoost is SOTA for many things (especially in my line of work, and turns out XGBoost was born in this field, amazing enough). Of course a Julia native XGBoost would be ideal and very cool, but I don't think it's on anyone's priority list

Aug 12 '21 20:08 Moelf

EvoTrees.jl is a pure julia gradient tree boosting algorithm which already has a lot of functionality to be found in XGBoost and, as far as I can tell, is implementing basically the same algorithm. It does not have all the bells and whistles, but it is being actively developed.

Aug 13 '21 01:08 ablaom

MLJ.jl MLJ.jl copied to clipboard

Feature importance / model inspection

MLJ.jl
MLJ.jl copied to clipboard