NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] Groupby op does not respect converted dtypes by LambdaOP when there are null entires in the columns

Open rnyak opened this issue 2 years ago • 1 comments

Describe the bug

I am converting float dtypes to int64 with LambdaOP and then I add Groupby op to generate list columns. However, I noticed that, the final dtypes are not integers, they are still floats and this happens when I have nulls in the columns.

Please run the code snippet below to generate the issue:

gdf = cudf.DataFrame(
    {
        "C1": ['1', '2', '3', '1', '3', '2'] *3,
        "C2": [22.0, 12.0, 18.0, 13.0, np.nan, 18.0] *3,
        "C3": [2059.0, 6082.0, 4803.0, 6082.0, np.nan, 5920.0] *3
    }
)
feats_int = ['C2', 'C3'] >> LambdaOp(lambda col: col.astype('int64'))

features = ['C1'] + feats_int

groupby_features= features >> nvt.ops.Groupby(
    groupby_cols=["C1"], 
    sort_cols=['C1'],
    aggs={
        'C2': ["list"],
        'C3': ["list"],  
        },
    name_sep="_") 

train_dataset = nvt.Dataset(gdf)
workflow = nvt.Workflow(groupby_features)
workflow.fit(train_dataset)
tmp = workflow.transform(train_dataset).to_ddf().compute()
print(tmp)

	C1	C2_list	                                          C3_list
0	1	[22.0, 13.0, 22.0, 13.0, 22.0, 13.0]	[2059.0, 6082.0, 2059.0, 6082.0, 2059.0, 6082.0]
1	2	[12.0, 18.0, 12.0, 18.0, 12.0, 18.0]	[6082.0, 5920.0, 6082.0, 5920.0, 6082.0, 5920.0]
2	3	[18.0, nan, 18.0, nan, 18.0, nan]	[4803.0, nan, 4803.0, nan, 4803.0, nan]

Expected behavior The final processed file should be as follows:

	C1	C2_list	                         C3_list
0	1	[22, 13, 22, 13, 2, 13]	[2059, 6082, 2059, 6082, 2059, 6082]
...

Environment details (please complete the following information):

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of NVTabular install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

merlin-tensorflow-training:22.05 container.

rnyak avatar May 19 '22 17:05 rnyak

See the cudf docs for an explanation of how NaN works. (Long story short, NaN is a float.)

We might be able to make the NaNs <NA>s instead, but I'm not sure that's really any better.

karlhigley avatar May 20 '22 01:05 karlhigley