NVTabular
NVTabular copied to clipboard
[BUG] Categorifying multiple columns result into error
Describe the bug
When I categorify a transformed column with the same name, the properties.domain.name is wrong.
In addition the transformed data is unexpected as well.
properties.domain.name is each time timestamp -> the expectation is timestamp_month and timestamp_year./
import cudf
df = cudf.DataFrame({
'timestamp': [
'2022-01-01',
'2022-01-02',
'2022-01-04',
'2022-01-05',
'2022-01-06',
'2022-01-07',
'2022-01-08',
'2022-01-09',
'2022-01-10',
'2022-01-11',
'2022-01-12',
'2022-02-01',
'2022-02-02',
'2022-02-04',
'2022-02-05',
'2022-02-06',
'2022-02-07',
'2022-02-08',
'2022-02-09',
'2022-02-10',
'2022-02-11',
'2022-02-12'
]
})
df['timestamp'] = cudf.to_datetime(df['timestamp'])
month = [
'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.month) >> nvt.ops.Categorify() >> nvt.ops.Rename(postfix='_month')
year = [
'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.year) >> nvt.ops.Categorify() >> nvt.ops.Rename(postfix='_year')
features = year + month + ['timestamp']
workflow = nvt.Workflow(features)
workflow.fit(nvt.Dataset(df))
df_transformed = workflow.transform(nvt.Dataset(df))
df_transformed.schema
@bschifferer - I'm struggling to decide if this behavior is a bug.
Your workflow is effectively applying Categorify on the same column twice (with different unique values). As far as I can tell, using nvt.ops.LambdaOp(lambda x: x.dt.month) should not change the name of the "timestamp" column. This means that you are writing the unique.timestamp.parquet file twice (once for "timestamp" overwritten by col.dt.month and then again for "timestamp" overwritten by col.dt.year).
I think the following workflow gives you the behavior you want:
month = [
'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.month) >> nvt.ops.Rename(postfix='_month') >> nvt.ops.Categorify()
year = [
'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.year) >> nvt.ops.Rename(postfix='_year') >> nvt.ops.Categorify()
features = year + month + ['timestamp']
Do you feel that a LambdaOp should be renaming input columns by default? The default behavior is to simply overwrite the original column, and I'm not sure if changing this will break existing user code (or if it is a good idea).
Perhaps we can try to detect (and warn) when two or more fit operations are writing inconsistent statistics?