[python-package] Add `feature_names_in_` attribute for scikit-learn estimators (fixes #6279)
Fixes: #6279
Related: https://github.com/scikit-learn/scikit-learn/issues/28337
In https://github.com/scikit-learn/scikit-learn/issues/28337#issuecomment-1921071816, I noticed someone said
this feature comes for free if you inherit from
BaseEstimator
lightgbm's scikit-learn estimators do inherit from BaseEstimator
https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/sklearn.py#L1072
https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/sklearn.py#L430
https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/sklearn.py#L15-L17
https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/compat.py#L106
https://github.com/microsoft/LightGBM/blob/cc733f8595267f313886f92ed5d1285010ba8f3f/python-package/lightgbm/compat.py#L83
If you get into this and find that lightgbm is actually getting that attribute via inheriting from BaseEstimator, don't give up on the PR! Those tests I mentioned would still be very valuable to catch changes to that support in the future and to be sure that lightgbm's integration with it has the expected behavior.
@jameslamb Thank you for the great feedback! I'm working on adding these suggestions in.
Is there a way you recommend recreating the development environment? I couldn't find info on this in the CONTRIBUTING.md so started to mimic the logic specified in .ci/test.sh but having to specify different global variables as they appear in the script prevents this from being a quick way to setup the environment. Just want to make sure I'm not missing a quicker way.
Thanks in advance!
Thanks! There isn't a well-documented way to set up a local development environment for the Python package today, it's something I'd like to add soon.
Here's how I develop on LightGBM:
- Create a
condaenvironment (I use miniforge, to prefer conda-forge)
conda create \
--name lgb-dev \
cloudpickle \
dask \
distributed \
joblib \
matplotlib \
numpy \
python-graphviz \
pytest \
pytest-cov \
python=3.11 \
scikit-learn \
scipy
- build the C++ library one time (assuming you're making Python-only changes)
rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j4 _lightgbm
- make changes to the Python code
- install the Python package in the conda environment
source activate lgb-dev
sh build-python.sh install --precompile
- run the tests
pytest tests/python_package_test
- repeat steps 3-5 until you're confident in your changes
- run the auto-formatting and some of the linting stuff (this is a work in progress, see #6304)
pre-commit run --all-files
If you get into this and find that
lightgbmis actually getting that attribute via inheriting fromBaseEstimator, don't give up on the PR!
It turns out sklearn only adds the feature_names_in_ attribute if the input data has feature names, while LightGBM will add column names of the format "Column_{i}" if the input data doesn't have column names. I've added a comment to a test to highlight this difference with sklearn.
@microsoft-github-policy-service agree
@jameslamb given that _validate_data needs to be called in order to get these attributes for free from BaseEstimator, would it make sense to call this method within the LGBM estimators' fit methods (like many other sklearn estimators, one example: https://github.com/scikit-learn/scikit-learn/issues/27907#issuecomment-1848837382)?
One different behavior between LGBM and sklearn is that LGBM assigns artificial names to features if the features are unnamed, whereas sklearn doesn't create artificial names, and also doesn't create the feature_names_in_ attribute. So for numpy arrays, even calling _validate_data within fit wouldn't make this attribute accessible.
I wanted to confirm that we want to add _validate_data, but to also keep the behavior of setting names when they're not present.
feature_names_in_ should return a 1D numpy array, not a list
Sounds good, will fix.
get_feature_names_out() function should be implemented (right? or is that only for estimators that define .transform()?)
I have less of an opinion on this one, but based on the SLEP, it does look like it should be specifically for estimators with the transform method.:
Scope
The API for input and output feature names includes a feature_names_in_ attribute for all estimators, and a get_feature_names_out method for any estimator with a transform method, i.e. they expose the generated feature names via the get_feature_names_out method.
I wanted to confirm that we want to add
_validate_data
That method being prefixed with a _ suggests to me that it's an internal implementation detail of scikit-learn that could be changed in a future release of that library.
Can you find me some authoritative source saying that projects implementing their own estimators are encouraged to call that method? The comment you linked above is a specific recommendation from a scikit-learn maintainer about what to do for 2 estimators within scikit-learn... I don't interpret that as encouragement that other projects should call it.
xgboost does not: https://github.com/search?q=repo%3Admlc%2Fxgboost%20%22_validate_data%22&type=code
but catboost does: https://github.com/catboost/catboost/blob/19b60a20b2b1733c528b40c6c9ebe2f3d1f5dbde/contrib/python/scikit-learn/py3/sklearn/base.py#L537
Let's please pause on this work until some scikit-learn maintainer gives an authoritative answer on https://github.com/scikit-learn/scikit-learn/issues/28337.
Alright @nicklamiller I've read through the response at https://github.com/scikit-learn/scikit-learn/issues/28337#issuecomment-2068815197 carefully... I think we have enough information to proceed.
I think we should do the following here in LightGBM:
- add
feature_names_in_ - add
get_feature_names_out() - do not directly call
_validate_data()or any otherscikit-learnmethods prefixed with_
Let's try to as closely as possible follow the way that xgboost has handled this in their Python package.
Are you interested in continuing this?
@jameslamb I tried to closely follow xgboost w.r.t feature_names_in_, which is similar in its implementation in that feature names are obtained from another attribute from a _Booster object which is only returned if the model is fitted. If feature names can’t be accessed because the model is not fitted, a NotFittedError is raised (so similar to the LGBMNotFittedError that we raise here).
xgboost doesn’t implement the get_feature_names_out method, so I followed scikit-learn’s approach for this. Most objects that have this method are subclasses of TransformerMixin some examples:
The few classes that inherit from an estimator-like mixin (ClassifierMixin or RegressorMixin) appear to be meta-estimators that specify features for each individual estimator that the meta estimator is comprised of:
So it looks like get_feature_names_out is in fact defined almost exclusively for transformers. I’ve still added it here but since LGBMModel and its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply return feature_names_in_ within the method.
Please let me know how this looks!
So it looks like
get_feature_names_outis in fact defined almost exclusively for transformers. I’ve still added it here but sinceLGBMModeland its subclasses don’t apply transformations to features and aren’t meta-estimators, I simply returnfeature_names_in_within the method.I asked for it to be added based on this statement form @adrinjalali in scikit-learn/scikit-learn#28337 (comment): "scikit-learn estimators all add ...
get_feature_names_out()".I think the choice to just return
feature_names_in_makes sense forlightgbm, thank you.
I think for lightgbm's classifiers and regressors, it makes sense to not implement get_feature_names_out() at all. They're indeed only available on transformers in scikit-learn.
Alright, thanks for that clarification.
@nicklamiller please remove that method then.
@jameslamb Thanks very much for the suggestions! I removed get_feature_names_out() and fixed up the tests as specified above, please let me know how this looks.
Thanks very much for keeping this up to date with master.
I'll review again soon, but it's too late for this to make it into the v4.4.0 release (#6439). Sorry about that, but we are rushing to get that release up before numpy 2.0 comes out 2 days from now.
@borchero could you re-run the failing Azure check and merge this if it passes? I'll be away from my laptop for the next day or 2
I just restarted that failing CI job, will check back in a bit.
Sorry @jameslamb, I saw your comment only now! Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄
Unfortunately, I also can't restart Azure jobs, my login continues to fail 🙄
No problem! I know, I also sometimes have to log into Azure DevOps multiple times and in different ways to finally get in. I'll send you a private message with some tips to try.
Thanks so much @nicklamiller ! I'm sorry this was such a long process for you. We hope you'll consider coming back and contributing more in the future.
If you're interested in helping with the Python package specifically, this would be a good next place to go: #6361
@jameslamb thank you for all the help and guidance, I'm looking forward to contributing more!