category_encoders icon indicating copy to clipboard operation
category_encoders copied to clipboard

[ENH] Min_leaf_size for OrdinalEncoder (or arbitrary encoders)

Open No-Stream opened this issue 3 years ago • 0 comments

Hi. I would like to propose an enhancement for OrdinalEncoder, although it may also be relevant for other encoders and could be implemented in a generic fashion. In particular, it could be really useful for OneHotEncoder, to reduce memory usage/improve training time.

Before encoding categories, find the size of each category. For categories smaller than min_leaf_size, encode as "other" or similar / assign to a single group. I could imagine edge cases around like - what about categories not encountered in training - do those get assigned to "other" or a separate category? How do we "reserve" a category for "other" if not encountered in training? Etc.

Anecdotally, I often find this to improve performance/ reduce overfitting and would love to not have to copy-paste my code snippet between notebooks 😉. Happy to contribute an implementation if you'd like - should be straightforward.

No-Stream avatar Mar 29 '21 16:03 No-Stream