ibis-ml feat(steps): handle unknown category for all encoders

Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.

Open an issue to record this for future consideration.

The current implemenations:

CategoricalEncode will convert unknown category to None
OneHotEncode will convert all encoded cols to 0, see following example.
CountEncode will convert unknown category to 0

For example:

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res

AMZN in the 5th row is unknown, it will be translated to all 0s

┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time                    ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp               │ int8        │ int8        │ int8        │ int8        │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.038 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.048 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.049 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.050 │           0 │           0 │           0 │           0 │
│ 2016-05-25 13:30:00.051 │           0 │           0 │           0 │           1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

Apr 19 '24 20:04 jitingxu1

CategoricalEncode will convert unknown category to None

I haven't looked into whether this is correct or not, since the step doesn't have any tests; we should definitely add one, and then can identify the correct behaviors. 😅

OneHotEncode will convert all encoded cols to 0, see following example.

Having a separate category for unknown (i.e. rest of the encoding column values are all 0) could be a nice option to provide/give the user more flexibility, but may not matter for a lot of model types (e.g. GBDT).

CountEncode will convert unknown category to 0

This is intentional, as the count should be 0 for something that has not been seen.

To me, seems like the immediate action item is to add a test for CategoricalEncode to make sure it's functionality is correct, and to (at lower priority) make the OneHotEncode unknown category handling a bit more flexible.

May 07 '24 18:05 deepyaman

@jitingxu1 closing this; if want to make OneHotEncode unknown category handling a bit more flexible, feel free to create a new issue, but it doesn't seem to be a priority at this time.

Jul 01 '24 19:07 deepyaman

ibis-ml ibis-ml copied to clipboard

feat(steps): handle unknown category for all encoders

ibis-ml
ibis-ml copied to clipboard