category_encoders
category_encoders copied to clipboard
Support for Enum-Encoding as in H2o
Hello,
the H2o ML Framework supports an enum-encoding scheme. It would be nice to have this for sklearn as well. As far as I know there are no contributions made to add this for sklearn models.
Anyone has thoughts about this?
Do you know of a paper or reference from enum encoding? A quick read of H2O docs has me thinking it may be the same as either one-hot encoding or baseN encoding with base 1.
Hi, I don't know any suitable reference/paper for this. But LightGBM has implemented this as well and there is a gitHub Issue where they talk about finding the optimal split for this kind of categorical encoding. Link
I guess this feature is already implemented in ordinal encoding.
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random
Below is from H2O:
Enum is for a categorical column whose values have no ordinal mean. For example you could have a column like eye color: brown, green, blue. You would use enum encoding here because brown eyes are not necessarily greater than green eyes, they are just different.
Label encoding assumes some sort of ordinal nature in your categorical columns. For example you might have a column like credit: good, bad, terrible. Here there is an inherent order (i.e. good is better than bad) so you would want to encode this categorical with label encoding so that the encoding considers some sense of order.
I think @Chandrak1907 is right. Nevertheless, H2O has at least two potentially nice optional parameters:
enum_limited or EnumLimited: Automatically reduce categorical levels to the most prevalent ones during Aggregator training and only keep the T most frequent levels.
sort_by_response or SortByResponse: Reorders the levels by the mean response (for example, the level with lowest response -> 0, the level with second-lowest response -> 1, etc.). This is useful in GBM/DRF, for example, when you have more levels than nbins_cats, and where the top level splits now have a chance at separating the data with a split. Note that this requires a specified response column.
Both of them can be relatively easily added into OrdinalEncoder. If someone wants to write a PR, please, go ahead.