evalml
evalml copied to clipboard
Smarter values of top_n for One Hot Encoder in AutoML
Currently, AutoMLSearch
will only fit OneHotEncoder
s with top_n
set to 10.
This can be problematic because a user can have data with more than 10 categories, e.g. 50 US states, and the 40 least frequent states in the data get lumped together.
We currently have two workarounds for this:
- Pass in a
pipeline_parameters
argument toAutoML
where the value oftop_n
is set by the user. This may not be the best because this value oftop_n
would be applied for all categorical features. - User can create a DAG with multiple OHE (one for each categorical feature) where
top_n
varies for each OHE.
Ideally, AutoMLSearch
would set a smart value of top_n
for each categorical feature automatically that the user could then change depending on their domain knowledge.
This is closely related to #1728 but I feel it's a separate as it relates specifically to top_n
in the OHE.
Maybe it makes sense to mark this as blocked by #1728.
@freddyaboulton great point. I agree that a) we should support setting top_n
and other OHE params per-feature, and that b) having a "smart" value for top_n
could be beneficial.
I will file a) separately. For b): do you have an idea for a method or heuristic? Let's discuss.