LightGBM [python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279)

Fixes: #6279

Related: https://github.com/scikit-learn/scikit-learn/issues/28337

Feb 12 '24 06:02 nicklamiller

In https://github.com/scikit-learn/scikit-learn/issues/28337#issuecomment-1921071816, I noticed someone said

this feature comes for free if you inherit from BaseEstimator

lightgbm's scikit-learn estimators do inherit from BaseEstimator

https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/sklearn.py#L1072

https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/sklearn.py#L430

https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/sklearn.py#L15-L17

https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/compat.py#L106

https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/compat.py#L83

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR! Those tests I mentioned would still be very valuable to catch changes to that support in the future and to be sure that lightgbm's integration with it has the expected behavior.

Feb 13 '24 02:02 jameslamb

@jameslamb Thank you for the great feedback! I'm working on adding these suggestions in.

Is there a way you recommend recreating the development environment? I couldn't find info on this in the CONTRIBUTING.md so started to mimic the logic specified in .ci/test.sh but having to specify different global variables as they appear in the script prevents this from being a quick way to setup the environment. Just want to make sure I'm not missing a quicker way.

Thanks in advance!

Feb 20 '24 04:02 nicklamiller

Thanks! There isn't a well-documented way to set up a local development environment for the Python package today, it's something I'd like to add soon.

Here's how I develop on LightGBM:

Create a conda environment (I use miniforge, to prefer conda-forge)

conda create \
    --name lgb-dev \
    cloudpickle \
    dask \
    distributed \ 
    joblib \
    matplotlib \
    numpy \
    python-graphviz \
    pytest \
    pytest-cov \
    python=3.11 \
    scikit-learn \
    scipy

build the C++ library one time (assuming you're making Python-only changes)

rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j4 _lightgbm

make changes to the Python code
install the Python package in the conda environment

source activate lgb-dev
sh build-python.sh install --precompile

run the tests

pytest tests/python_package_test

repeat steps 3-5 until you're confident in your changes
run the auto-formatting and some of the linting stuff (this is a work in progress, see #6304)

pre-commit run --all-files

Feb 20 '24 04:02 jameslamb

If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR!

It turns out sklearn only adds the feature_names_in_ attribute if the input data has feature names, while LightGBM will add column names of the format "Column_{i}" if the input data doesn't have column names. I've added a comment to a test to highlight this difference with sklearn.

Mar 28 '24 19:03 nicklamiller

@microsoft-github-policy-service agree

Mar 28 '24 23:03 nicklamiller

@jameslamb given that _validate_data needs to be called in order to get these attributes for free from BaseEstimator, would it make sense to call this method within the LGBM estimators' fit methods (like many other sklearn estimators, one example: https://github.com/scikit-learn/scikit-learn/issues/27907#issuecomment-1848837382)?

One different behavior between LGBM and sklearn is that LGBM assigns artificial names to features if the features are unnamed, whereas sklearn doesn't create artificial names, and also doesn't create the feature_names_in_ attribute. So for numpy arrays, even calling _validate_data within fit wouldn't make this attribute accessible.

I wanted to confirm that we want to add _validate_data, but to also keep the behavior of setting names when they're not present.

Apr 11 '24 19:04 nicklamiller

feature_names_in_ should return a 1D numpy array, not a list

Sounds good, will fix.

get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)

I have less of an opinion on this one, but based on the SLEP, it does look like it should be specifically for estimators with the transform method.:

Scope

The API for input and output feature names includes a feature_names_in_ attribute for all estimators, and a get_feature_names_out method for any estimator with a transform method, i.e. they expose the generated feature names via the get_feature_names_out method.

Apr 11 '24 19:04 nicklamiller

I wanted to confirm that we want to add _validate_data

That method being prefixed with a _ suggests to me that it's an internal implementation detail of scikit-learn that could be changed in a future release of that library.

Can you find me some authoritative source saying that projects implementing their own estimators are encouraged to call that method? The comment you linked above is a specific recommendation from a scikit-learn maintainer about what to do for 2 estimators within scikit-learn... I don't interpret that as encouragement that other projects should call it.

xgboost does not: https://github.com/search?q=repo%3Admlc%2Fxgboost%20%22_validate_data%22&type=code

but catboost does: https://github.com/catboost/catboost/blob/19b60a20b2b1733c528b40c6c9ebe2f3d1f5dbde/contrib/python/scikit-learn/py3/sklearn/base.py#L537

Let's please pause on this work until some scikit-learn maintainer gives an authoritative answer on https://github.com/scikit-learn/scikit-learn/issues/28337.

Apr 15 '24 02:04 jameslamb

Alright @nicklamiller I've read through the response at https://github.com/scikit-learn/scikit-learn/issues/28337#issuecomment-2068815197 carefully... I think we have enough information to proceed.

I think we should do the following here in LightGBM:

add feature_names_in_
add get_feature_names_out()
do not directly call _validate_data() or any other scikit-learn methods prefixed with _

Let's try to as closely as possible follow the way that xgboost has handled this in their Python package.

Are you interested in continuing this?

May 26 '24 05:05 jameslamb

@jameslamb I tried to closely follow xgboost w.r.t feature_names_in_, which is similar in its implementation in that feature names are obtained from another attribute from a _Booster object which is only returned if the model is fitted. If feature names can’t be accessed because the model is not fitted, a NotFittedError is raised (so similar to the LGBMNotFittedError that we raise here).

xgboost doesn’t implement the get_feature_names_out method, so I followed scikit-learn’s approach for this. Most objects that have this method are subclasses of TransformerMixin some examples:

The few classes that inherit from an estimator-like mixin (ClassifierMixin or RegressorMixin) appear to be meta-estimators that specify features for each individual estimator that the meta estimator is comprised of:

So it looks like get_feature_names_out is in fact defined almost exclusively for transformers. I’ve still added it here but since LGBMModel and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply return feature_names_in_ within the method.

Please let me know how this looks!

Jun 01 '24 03:06 nicklamiller

So it looks like get_feature_names_out is in fact defined almost exclusively for transformers. I’ve still added it here but since LGBMModel and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply return feature_names_in_ within the method.

I asked for it to be added based on this statement form @adrinjalali in scikit-learn/scikit-learn#28337 (comment): "scikit-learn estimators all add ... get_feature_names_out()".

I think the choice to just return feature_names_in_ makes sense for lightgbm, thank you.

I think for lightgbm's classifiers and regressors, it makes sense to not implement get_feature_names_out() at all. They're indeed only available on transformers in scikit-learn.

Jun 10 '24 10:06 adrinjalali

Alright, thanks for that clarification.

@nicklamiller please remove that method then.

Jun 10 '24 13:06 jameslamb

@jameslamb Thanks very much for the suggestions! I removed get_feature_names_out() and fixed up the tests as specified above, please let me know how this looks.

Jun 11 '24 01:06 nicklamiller

Thanks very much for keeping this up to date with master.

I'll review again soon, but it's too late for this to make it into the v4.4.0 release (#6439). Sorry about that, but we are rushing to get that release up before numpy 2.0 comes out 2 days from now.

Jun 14 '24 18:06 jameslamb

@borchero could you re-run the failing Azure check and merge this if it passes? I'll be away from my laptop for the next day or 2

Jun 22 '24 18:06 jameslamb

I just restarted that failing CI job, will check back in a bit.

Jun 24 '24 04:06 jameslamb

Sorry @jameslamb, I saw your comment only now! Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄

Jun 24 '24 13:06 borchero

Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄

No problem! I know, I also sometimes have to log into Azure DevOps multiple times and in different ways to finally get in. I'll send you a private message with some tips to try.

Jun 24 '24 18:06 jameslamb

Thanks so much @nicklamiller ! I'm sorry this was such a long process for you. We hope you'll consider coming back and contributing more in the future.

If you're interested in helping with the Python package specifically, this would be a good next place to go: #6361

Jul 03 '24 23:07 jameslamb

@jameslamb thank you for all the help and guidance, I'm looking forward to contributing more!

Jul 04 '24 03:07 nicklamiller