FLAML icon indicating copy to clipboard operation
FLAML copied to clipboard

Handling categorical variables on new data

Open krolikowskib opened this issue 2 years ago • 4 comments

Hey, I have 2 questions regarding how FLAML handles categorical variables on new data, different from the initial training dataset (for example, during inference after model deployment).

  1. Does it handle new categories in categorical features (unseen during training)?
  2. SKLearn and XGBoost estimators use ordinal encodings of categorical features. But it seems the categorical codes are extracted during inference (code). Doesn't it mean that the encodings will be different when running on a different dataset, thus mixing the categories passed to the model? If so, then sklearn's OrdinalEncoder would be a better choice here (persisting correct category codes).

krolikowskib avatar Jun 29 '23 08:06 krolikowskib

Hi, Thanks for your feedback.

  1. Does it handle new categories in categorical features (unseen during training)? No, it doesn't handle new categories. XGBoost library assumes that category mappings are managed by the application, both in the training/testing phase and we follow the same logic in FLAML. When new categories come in the test phase, the categories-encoding map will be different from the map in the training phase if we do not do any process.
  2. Doesn't it mean that the encodings will be different when running on a different dataset, thus mixing the categories passed to the model? Yes, when running on a different dataset with mixing categories, the encoding will change. OrdinalEncoder may be a better choice here. Thanks for your suggestion.

Reference: https://stackoverflow.com/questions/75698242/when-using-categorical-data-in-xgboost-how-do-i-maintain-the-implied-encoding

skzhang1 avatar Jul 02 '23 13:07 skzhang1

Thanks for your answer, @skzhang1.

I think that it may be misleading for people who want to reuse the model on a different dataset, like in a production setting. Even if the categories are the same, the current implemention doesn't guarantee they will be encoded in the same way.

Did you consider to make that more explicit in the documentation or provide a way to easily reuse the best selected model without having to worry about categorical variables?

krolikowskib avatar Jul 02 '23 14:07 krolikowskib

Thanks for your suggestion! We will make it clear in the doc. @krolikowskib

skzhang1 avatar Jul 05 '23 02:07 skzhang1