category_encoders
category_encoders copied to clipboard
OrdinalEncoder unseen value spec
Hi, this is not a bug report but rather a feature request (not sure if this is the place or how).
It would be great to be able to specify the "value" an unseen val should take when using the OrdinalEncoder
-- rather than fixing it to -1
.
For instance, I would like to use this encoder as my preprocessing step, before calling a LightGBM classifier (which expects all categorical feature values to be non-negative integers), within a PMMLPipeline (which currently supports ce.OrdinalEncoder
).
Nowadays, the way around this would be to construct my encoding mapping beforehand and specifying it as the mapping
param, which is a bit of an overkill... am I missing something? is there another way?
Thanks!
I understand this is going to be introduced in SKlearn's OrdinalEncoder in v.0.24.
One of the big advantages of this library is a rather common interface to all the different encoders (e.g. for handling missing values or unknowns). It makes a lot of sense to keep this. So if we want to have this flexible we'd need to introduce it for all encoders and then it would be consequent to also have the missing (currently -2) flexible as well.
The downside to this is that another 2 parameters are introduced in the __init__
function which makes it somewhat big. The workaround is also rather easy, isnt it? you just replace the -1 with some other value using
df.replace`.
But I'm open to discuss this further if a lot of people find it super convenient it can be worth it.
EDIT: I just realised the replace workaround won't work in pipelines (as you noted) and supplying a mapping indeed feels like an overkill