feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Ignore NaNs before using `OrdinalEncoder`

Open datacubeR opened this issue 3 years ago • 1 comments

Is your feature request related to a problem? OrdinalEncoder should accept nulls. Sometimes you don't want to impute directly but using Imputing Options of XGBoost, LightGBM or CatBoost. Because of the constraints of always Impute before this is currently not possible.

Describe the solution you'd like I like that Feature Engine forces you to Impute first, but I will add some kind of default flag ignore_nan=False in case we want to use other imputation afterwards.

Hope you find this helpful.

datacubeR avatar Aug 06 '22 02:08 datacubeR

Hi @datacubeR

Thanks for the suggestion.

You are certainly not the first one who'd like the encoders to support NaN. I think @david-cortes made a similar suggestion here #481, am I right?

It would be great if those interested in this functionality could upvote / like or leave a comment in any of the 2 issues to better gauge the interest in this functionality.

solegalli avatar Aug 06 '22 13:08 solegalli

Hi @glevv

We've got a few requests to allow feature-engine encoders to not raise an error when the variable has nan.

At the moment, the encoders are designed to require imputation before encoding.

I think, the idea is to let the encoders encode variables if they have nan and leaving the nan as nan. The motivation is that some algorithms, like lightgbm (not sure which else?), can handle nan out of the box.

What do you think about this? would this be useful only for lightgbms? something else? if just lightgbm, is this worth the effort?

And would you be happy to pick this up ?

I think we should add a param in the init, handle_missing, defaulting to "raise" not to break backwards compatibility, but which the users could change to ignore to leave nan as nan.

solegalli avatar Nov 15 '22 15:11 solegalli

Hello @solegalli ! I think XGBoost, LightGBM, CatBoost (simple thou) and HistGradientBoosting support inputs with nans. I think there were also some clustering algorithms that supports nans, but nothing more. It's more of a UX/convenience improvement, so if you have more important tasks on the roadmap, this could wait. But it's possible to extend functionality of handle_missing to support ignoring nans, +1 on not changing the default value. As for making a PR: I can try, but I'm not sure that I will have a lot of free time, since, well, end of the year crunch and all that.

P.S. I'll start working on it on the weekend

glevv avatar Nov 15 '22 16:11 glevv

handle_missing could be implemented in the base class in transform function, but RareLabelEncoder, OneHotEncoder and DecisionTreeEncoder redefine transform function, so they will need to be updated manually. StringSimilarityEncoder already has this functionality, will need just harmonization of names.

glevv avatar Nov 17 '22 16:11 glevv

Maybe it is worth exploring creating a MixIn?

solegalli avatar Nov 17 '22 17:11 solegalli

I'm not sure how it would help. My current understanding is that we need to add additional lines to fit and transform methods that will ignore nans and assign nans according to input, but not all encoders use base class methods. On top of that logic of some encoders just don't allow nans, like DecisionTreeEncoder.

glevv avatar Nov 18 '22 10:11 glevv