evalml
evalml copied to clipboard
Handle different categories in training vs holdout data for Ordinal Encoder
If there are categories present in holdout data that weren't present in the training data, the OrdinalEncoder will not work unless handle_unknown
and unknown_value
are set correctly. This is problematic for the initial integration of the OrdinalEncoder into AutoMLSearch, as the default value for handle_unknown
is error.
This can also be problematic for the Ordinal logical type, which will set the order
according to the categories that are present, so if we were to try and set the instantiated Ordinal Logical Type on holdout data with different categories, it may produce a Woodwork error that the data contains values that are not present in the order values provided
. We should investigate when we may trigger this Woodwork error, and I've opened up an issue in Woodwork to consider ways to handle this kind of thing (https://github.com/alteryx/woodwork/issues/1598).
We should look into how we can handle this. We have several options:
- Handle this as part of automl search in the OrdinalEncoder instantiation by setting the parameters such that we handle unknowns gracefully - I think this may make the most sense, and could allow users to have further control of how they would want to handle those unknown values.
- Wait to set the Encoder's categories until transform/allow updating the values at transform. I think waiting to set the categories at all until transform is probably putting too much logic into
transform
, and could also create the reverse problem of not having categories from the training data. More likely, we will want to consider allowing users to expand the categories if needed. - Change the default value for
handle_unknown
to no longer error - maybe to set the values to be nans?