models
models copied to clipboard
[BUG] Wrong Feature Engineering step in the Dressipi example notebook
Bug description
In the Dressipi notebook, after executing the "Feature Engineering with NVTabular" step, the transformed validation set output only has one row of session_id
== 0
Expected behavior
I think the reason of the bug is caused by this code chunk:
%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = ['session_id', ['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify()
features = ['timestamp','date'] + cat_features
When you categorify session_id
, it will fit on training set. Since training set and validation set have totally different session_id
, when the workflow transforms validation set, it will turn all session_id
into 0, so the output will only have one row of session_id
== 0 after the groupby operation.
In this case we don't need create embedding for session_id
, so it's not necessary to categorify session_id
in the first place. Changing the code to the following will solve the bug:
%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = [['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify()
features = ['session_id','timestamp','date'] + cat_features