ibis-ml icon indicating copy to clipboard operation
ibis-ml copied to clipboard

feat(steps): handle unknown category for all encoders

Open jitingxu1 opened this issue 10 months ago • 1 comments

Unknown categories are currently ignored in the current encoding implementations. While we should consider adding an option to handle this in the future, it's not a high priority at the moment.

Open an issue to record this for future consideration.

The current implemenations:

  • CategoricalEncode will convert unknown category to None
  • OneHotEncode will convert all encoded cols to 0, see following example.
  • CountEncode will convert unknown category to 0

For example:

>>> import ibis
>>> import ibisml as ml
>>> import pandas as pd
>>>
>>> t_train = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.030"),
...                 pd.Timestamp("2016-05-25 13:30:00.041"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.072"),
...                 pd.Timestamp("2016-05-25 13:30:00.075"),
...             ],
...             "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", None, "AAPL", "GOOG", "MSFT"],
...         }
...     )
>>> t_test = ibis.memtable(
...         {
...             "time": [
...                 pd.Timestamp("2016-05-25 13:30:00.023"),
...                 pd.Timestamp("2016-05-25 13:30:00.038"),
...                 pd.Timestamp("2016-05-25 13:30:00.048"),
...                 pd.Timestamp("2016-05-25 13:30:00.049"),
...                 pd.Timestamp("2016-05-25 13:30:00.050"),
...                 pd.Timestamp("2016-05-25 13:30:00.051"),
...             ],
...             "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AMZN", None],
...         }
...     )
>>> step = ml.OneHotEncode("ticker")
>>> step.fit_table(t_train, ml.core.Metadata())
>>> res = step.transform_table(t_test)
>>> res

AMZN in the 5th row is unknown, it will be translated to all 0s

┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ time                    ┃ ticker_AAPL ┃ ticker_GOOG ┃ ticker_MSFT ┃ ticker_None ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ timestamp               │ int8        │ int8        │ int8        │ int8        │
├─────────────────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 2016-05-25 13:30:00.023 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.038 │           0 │           0 │           1 │           0 │
│ 2016-05-25 13:30:00.048 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.049 │           0 │           1 │           0 │           0 │
│ 2016-05-25 13:30:00.050 │           0 │           0 │           0 │           0 │
│ 2016-05-25 13:30:00.051 │           0 │           0 │           0 │           1 │
└─────────────────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

jitingxu1 avatar Apr 19 '24 20:04 jitingxu1