evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Smarter values of top_n for One Hot Encoder in AutoML

Open freddyaboulton opened this issue 4 years ago • 1 comments

Currently, AutoMLSearch will only fit OneHotEncoders with top_n set to 10.

This can be problematic because a user can have data with more than 10 categories, e.g. 50 US states, and the 40 least frequent states in the data get lumped together.

We currently have two workarounds for this:

  1. Pass in a pipeline_parameters argument to AutoML where the value of top_n is set by the user. This may not be the best because this value of top_n would be applied for all categorical features.
  2. User can create a DAG with multiple OHE (one for each categorical feature) where top_n varies for each OHE.

Ideally, AutoMLSearch would set a smart value of top_n for each categorical feature automatically that the user could then change depending on their domain knowledge.

This is closely related to #1728 but I feel it's a separate as it relates specifically to top_n in the OHE.

Maybe it makes sense to mark this as blocked by #1728.

freddyaboulton avatar Feb 25 '21 16:02 freddyaboulton

@freddyaboulton great point. I agree that a) we should support setting top_n and other OHE params per-feature, and that b) having a "smart" value for top_n could be beneficial.

I will file a) separately. For b): do you have an idea for a method or heuristic? Let's discuss.

dsherry avatar Feb 25 '21 18:02 dsherry