[ENH] handling of categorical features and sensible defaults
The v4 hyperactive wrappers of GFO have a feature where they encode categorical features as consecutive integers - this kind of encoding is a desirable feature, potentially as a default.
Related issues:
- There is also a potentially undesirable secondary effect, namely the encoding of numerical values as integers as well, which may or may not be desired by the user depending on circumstance.
- as an alterative to consecutive encoding - note that pure categoricals in general do not have an order - one could think of one-hot encoding
Some designs I can think of:
-
the current
hyperactive v4design that does the consecutive integer encoding by default for all categoricals and numericals -
encoding only categoricals, leaving numericals as-os
-
having tags for estimators on whether they can handle categoricals, e.g.,
capability:categorical.
Estimators that cannot handle categoricals - such as native GFO - return an error if categoricals are passed.
They can be wrapped in meta-estimators such as CategoricalEncoder.
- similar to 3, except that estimators without the capability encode automatically like
hyperactive v4.
Hi @fkiraly 👋,
Thanks for the detailed context — this proposal makes sense and seems like an important improvement for usability and sensible defaults.
Before I start contributing, could you please clarify if there are any pending subtasks or a preferred direction among the listed design options?
I'd be happy to pick up part of the implementation or help with drafting a prototype if needed.
Thanks!
Hello @fkiraly and @pankajbaid567 I have been working on a PR with the second design approach, informed to avoid confusion.