ibis-ml
ibis-ml copied to clipboard
feat(steps): handle unknown category for all encoders
Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.
Open an issue to record this for future consideration.
The current implemenations:
-
CategoricalEncode
will convert unknown category toNone
-
OneHotEncode
will convert all encoded cols to0
, see following example. -
CountEncode
will convert unknown category to0
For example:
>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
... {
... "time": [
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.030"),
... pd.Timestamp("2016-05-25 13:30:00.041"),
... pd.Timestamp("2016-05-25 13:30:00.048"),
... pd.Timestamp("2016-05-25 13:30:00.049"),
... pd.Timestamp("2016-05-25 13:30:00.072"),
... pd.Timestamp("2016-05-25 13:30:00.075"),
... ],
... "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
... }
... )
>>> t_test = ibis.memtable(
... {
... "time": [
... pd.Timestamp("2016-05-25 13:30:00.023"),
... pd.Timestamp("2016-05-25 13:30:00.038"),
... pd.Timestamp("2016-05-25 13:30:00.048"),
... pd.Timestamp("2016-05-25 13:30:00.049"),
... pd.Timestamp("2016-05-25 13:30:00.050"),
... pd.Timestamp("2016-05-25 13:30:00.051"),
... ],
... "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
... }
... )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res
AMZN
in the 5th row is unknown, it will be translated to all 0s
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp │ int8 │ int8 │ int8 │ int8 │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │ 0 │ 0 │ 1 │ 0 │
│ 2016-05-25 13:30:00.038 │ 0 │ 0 │ 1 │ 0 │
│ 2016-05-25 13:30:00.048 │ 0 │ 1 │ 0 │ 0 │
│ 2016-05-25 13:30:00.049 │ 0 │ 1 │ 0 │ 0 │
│ 2016-05-25 13:30:00.050 │ 0 │ 0 │ 0 │ 0 │
│ 2016-05-25 13:30:00.051 │ 0 │ 0 │ 0 │ 1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘