ibis-ml
ibis-ml copied to clipboard
feat(steps): handle unknown category for all encoders
Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.
Open an issue to record this for future consideration.
The current implemenations:
CategoricalEncodewill convert unknown category toNoneOneHotEncodewill convert all encoded cols to0, see following example.CountEncodewill convert unknown category to0
For example:
>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
... {
... "time": [
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.030"),
... pd.Timestamp("2016-05-25 13:30:00.041"),
... pd.Timestamp("2016-05-25 13:30:00.048"),
... pd.Timestamp("2016-05-25 13:30:00.049"),
... pd.Timestamp("2016-05-25 13:30:00.072"),
... pd.Timestamp("2016-05-25 13:30:00.075"),
... ],
... "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
... }
... )
>>> t_test = ibis.memtable(
... {
... "time": [
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.038"),
... pd.Timestamp("2016-05-25 13:30:00.048"),
... pd.Timestamp("2016-05-25 13:30:00.049"),
... pd.Timestamp("2016-05-25 13:30:00.050"),
... pd.Timestamp("2016-05-25 13:30:00.051"),
... ],
... "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
... }
... )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res
AMZN in the 5th row is unknown, it will be translated to all 0s
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp │ int8 │ int8 │ int8 │ int8 │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │ 0 │ 0 │ 1 │ 0 │
│ 2016-05-25 13:30:00.038 │ 0 │ 0 │ 1 │ 0 │
│ 2016-05-25 13:30:00.048 │ 0 │ 1 │ 0 │ 0 │
│ 2016-05-25 13:30:00.049 │ 0 │ 1 │ 0 │ 0 │
│ 2016-05-25 13:30:00.050 │ 0 │ 0 │ 0 │ 0 │
│ 2016-05-25 13:30:00.051 │ 0 │ 0 │ 0 │ 1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
CategoricalEncodewill convert unknown category toNone
I haven't looked into whether this is correct or not, since the step doesn't have any tests; we should definitely add one, and then can identify the correct behaviors. 😅
OneHotEncodewill convert all encoded cols to0, see following example.
Having a separate category for unknown (i.e. rest of the encoding column values are all 0) could be a nice option to provide/give the user more flexibility, but may not matter for a lot of model types (e.g. GBDT).
CountEncodewill convert unknown category to0
This is intentional, as the count should be 0 for something that has not been seen.
To me, seems like the immediate action item is to add a test for CategoricalEncode to make sure it's functionality is correct, and to (at lower priority) make the OneHotEncode unknown category handling a bit more flexible.
@jitingxu1 closing this; if want to make OneHotEncode unknown category handling a bit more flexible, feel free to create a new issue, but it doesn't seem to be a priority at this time.