NVTabular
NVTabular copied to clipboard
[BUG] get_embedding_sizes generates wrong embedding shape with encode_type = 'joint'
Describe the bug
When we jointly encode categorical columns, nvt.ops.get_embedding_sizes(workflow) does not generate the correct embedding table.
Steps/Code to reproduce bug
df = cudf.DataFrame({'a_user_id': ["User_A","User_E","User_B","User_C","User_A","User_B","User_B","User_C","User_B","User_A"],
'b_user_id': ["User_B", "User_F", "User_D", "User_D", "User_B", "User_E", "User_E", "User_D", "User_D", "User_D"],
'media':[3, 3, 12, 17, 3, 1, 1, 0, 1, 12], 'language': ['en', 'en', 'spn', 'fr', 'spn', 'en', 'fr', 'ch', 'ch', 'en']})
dataset = nvt.Dataset(df)
cat_users = ([['a_user_id','b_user_id']]) >> nvt.ops.Categorify(encode_type = 'joint')
cat_others = ['media', 'language'] >> nvt.ops.Categorify()
workflow = nvt.Workflow(cat_users + cat_others)
workflow.fit(dataset)
new_gdf = workflow.transform(dataset).to_ddf().compute()
new_gdf.head()
a_user_id b_user_id media language
0 1 2 3 2
1 5 6 3 2
2 2 4 4 4
3 3 4 5 3
4 1 2 3 4
nvt.ops.get_embedding_sizes(workflow)
{'media': (6, 16),
'language': (5, 16),
'a_user_id': (0, 16),
'b_user_id': (0, 16)}
Expected behavior The following embedding table shapes are expected:
nvt.ops.get_embedding_sizes(workflow)
{'media': (6, 16),
'language': (5, 16),
'a_user_id_b_user_id': (6, 16)}
Environment details (please complete the following information):
- Environment location: Docker Pytorch-training image.
- Method of NVTabular install: Docker
@benfred do you think this is a bug, or nvt.ops.get_embedding_sizes(workflow) gives the expected output?