NVTabular
NVTabular copied to clipboard
[BUG] Groupby op does not respect converted dtypes by LambdaOP when there are null entires in the columns
Describe the bug
I am converting float dtypes to int64
with LambdaOP and then I add Groupby op to generate list columns. However, I noticed that, the final dtypes are not integers, they are still floats and this happens when I have nulls in the columns.
Please run the code snippet below to generate the issue:
gdf = cudf.DataFrame(
{
"C1": ['1', '2', '3', '1', '3', '2'] *3,
"C2": [22.0, 12.0, 18.0, 13.0, np.nan, 18.0] *3,
"C3": [2059.0, 6082.0, 4803.0, 6082.0, np.nan, 5920.0] *3
}
)
feats_int = ['C2', 'C3'] >> LambdaOp(lambda col: col.astype('int64'))
features = ['C1'] + feats_int
groupby_features= features >> nvt.ops.Groupby(
groupby_cols=["C1"],
sort_cols=['C1'],
aggs={
'C2': ["list"],
'C3': ["list"],
},
name_sep="_")
train_dataset = nvt.Dataset(gdf)
workflow = nvt.Workflow(groupby_features)
workflow.fit(train_dataset)
tmp = workflow.transform(train_dataset).to_ddf().compute()
print(tmp)
C1 C2_list C3_list
0 1 [22.0, 13.0, 22.0, 13.0, 22.0, 13.0] [2059.0, 6082.0, 2059.0, 6082.0, 2059.0, 6082.0]
1 2 [12.0, 18.0, 12.0, 18.0, 12.0, 18.0] [6082.0, 5920.0, 6082.0, 5920.0, 6082.0, 5920.0]
2 3 [18.0, nan, 18.0, nan, 18.0, nan] [4803.0, nan, 4803.0, nan, 4803.0, nan]
Expected behavior The final processed file should be as follows:
C1 C2_list C3_list
0 1 [22, 13, 22, 13, 2, 13] [2059, 6082, 2059, 6082, 2059, 6082]
...
Environment details (please complete the following information):
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
- Method of NVTabular install: [conda, Docker, or from source]
- If method of install is [Docker], provide
docker pull
&docker run
commands used
- If method of install is [Docker], provide
merlin-tensorflow-training:22.05
container.
See the cudf docs for an explanation of how NaN
works. (Long story short, NaN
is a float.)
We might be able to make the NaN
s <NA>s
instead, but I'm not sure that's really any better.