models icon indicating copy to clipboard operation
models copied to clipboard

[BUG] Wrong Feature Engineering step in the Dressipi example notebook

Open zhiruiwang opened this issue 2 years ago • 0 comments

Bug description

In the Dressipi notebook, after executing the "Feature Engineering with NVTabular" step, the transformed validation set output only has one row of session_id == 0

Expected behavior

I think the reason of the bug is caused by this code chunk:

%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = ['session_id', ['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify()

features = ['timestamp','date'] + cat_features

When you categorify session_id, it will fit on training set. Since training set and validation set have totally different session_id, when the workflow transforms validation set, it will turn all session_id into 0, so the output will only have one row of session_id == 0 after the groupby operation.

In this case we don't need create embedding for session_id, so it's not necessary to categorify session_id in the first place. Changing the code to the following will solve the bug:

%%time
item_features_names = ['f_' + str(col) for col in [47, 68]]
cat_features = [['item_id', 'purchase_id']] + item_features_names >> nvt.ops.Categorify()

features = ['session_id','timestamp','date'] + cat_features

zhiruiwang avatar Sep 02 '22 16:09 zhiruiwang