ibis-ml
ibis-ml copied to clipboard
feat(steps): handle unknown category for all encoders
Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.
Open an issue to record this for future consideration.
The current implemenations:
-
CategoricalEncode
will convert unknown category toNone
-
OneHotEncode
will convert all encoded cols to0
, see following example. -
CountEncode
will convert unknown category to0
For example:
>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
... {
... "time": [
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.030"),
... pd.Timestamp("2016-05-25 13:30:00.041"),
... pd.Timestamp("2016-05-25 13:30:00.048"),
... pd.Timestamp("2016-05-25 13:30:00.049"),
... pd.Timestamp("2016-05-25 13:30:00.072"),
... pd.Timestamp("2016-05-25 13:30:00.075"),
... ],
... "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
... }
... )
>>> t_test = ibis.memtable(
... {
... "time": [
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.038"),
... pd.Timestamp("2016-05-25 13:30:00.048"),
... pd.Timestamp("2016-05-25 13:30:00.049"),
... pd.Timestamp("2016-05-25 13:30:00.050"),
... pd.Timestamp("2016-05-25 13:30:00.051"),
... ],
... "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
... }
... )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res
AMZN
in the 5th row is unknown, it will be translated to all 0s
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp │ int8 │ int8 │ int8 │ int8 │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │ 0 │ 0 │ 1 │ 0 │
│ 2016-05-25 13:30:00.038 │ 0 │ 0 │ 1 │ 0 │
│ 2016-05-25 13:30:00.048 │ 0 │ 1 │ 0 │ 0 │
│ 2016-05-25 13:30:00.049 │ 0 │ 1 │ 0 │ 0 │
│ 2016-05-25 13:30:00.050 │ 0 │ 0 │ 0 │ 0 │
│ 2016-05-25 13:30:00.051 │ 0 │ 0 │ 0 │ 1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
CategoricalEncode
will convert unknown category toNone
I haven't looked into whether this is correct or not, since the step doesn't have any tests; we should definitely add one, and then can identify the correct behaviors. 😅
OneHotEncode
will convert all encoded cols to0
, see following example.
Having a separate category for unknown (i.e. rest of the encoding column values are all 0
) could be a nice option to provide/give the user more flexibility, but may not matter for a lot of model types (e.g. GBDT).
CountEncode
will convert unknown category to0
This is intentional, as the count should be 0 for something that has not been seen.
To me, seems like the immediate action item is to add a test for CategoricalEncode
to make sure it's functionality is correct, and to (at lower priority) make the OneHotEncode
unknown category handling a bit more flexible.
@jitingxu1 closing this; if want to make OneHotEncode
unknown category handling a bit more flexible, feel free to create a new issue, but it doesn't seem to be a priority at this time.