category_encoders
category_encoders copied to clipboard
[ENH] Min_leaf_size for OrdinalEncoder (or arbitrary encoders)
Hi. I would like to propose an enhancement for OrdinalEncoder, although it may also be relevant for other encoders and could be implemented in a generic fashion. In particular, it could be really useful for OneHotEncoder, to reduce memory usage/improve training time.
Before encoding categories, find the size of each category. For categories smaller than min_leaf_size
, encode as "other" or similar / assign to a single group. I could imagine edge cases around like - what about categories not encountered in training - do those get assigned to "other" or a separate category? How do we "reserve" a category for "other" if not encountered in training? Etc.
Anecdotally, I often find this to improve performance/ reduce overfitting and would love to not have to copy-paste my code snippet between notebooks 😉. Happy to contribute an implementation if you'd like - should be straightforward.