evalml
evalml copied to clipboard
Smarter values of top_n for One Hot Encoder in AutoML
Currently, AutoMLSearch will only fit OneHotEncoders with top_n set to 10.
This can be problematic because a user can have data with more than 10 categories, e.g. 50 US states, and the 40 least frequent states in the data get lumped together.
We currently have two workarounds for this:
- Pass in a
pipeline_parametersargument toAutoMLwhere the value oftop_nis set by the user. This may not be the best because this value oftop_nwould be applied for all categorical features. - User can create a DAG with multiple OHE (one for each categorical feature) where
top_nvaries for each OHE.
Ideally, AutoMLSearch would set a smart value of top_n for each categorical feature automatically that the user could then change depending on their domain knowledge.
This is closely related to #1728 but I feel it's a separate as it relates specifically to top_n in the OHE.
Maybe it makes sense to mark this as blocked by #1728.
@freddyaboulton great point. I agree that a) we should support setting top_n and other OHE params per-feature, and that b) having a "smart" value for top_n could be beneficial.
I will file a) separately. For b): do you have an idea for a method or heuristic? Let's discuss.