NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] Categorifying multiple columns result into error

Open bschifferer opened this issue 3 years ago • 1 comments

Describe the bug When I categorify a transformed column with the same name, the properties.domain.name is wrong. In addition the transformed data is unexpected as well.

properties.domain.name is each time timestamp -> the expectation is timestamp_month and timestamp_year./

import cudf

df = cudf.DataFrame({
    'timestamp': [
        '2022-01-01',
        '2022-01-02',
        '2022-01-04',
        '2022-01-05',
        '2022-01-06',
        '2022-01-07',
        '2022-01-08',
        '2022-01-09',
        '2022-01-10',
        '2022-01-11',
        '2022-01-12',
        '2022-02-01',
        '2022-02-02',
        '2022-02-04',
        '2022-02-05',
        '2022-02-06',
        '2022-02-07',
        '2022-02-08',
        '2022-02-09',
        '2022-02-10',
        '2022-02-11',
        '2022-02-12'
    ]
    
})
df['timestamp'] = cudf.to_datetime(df['timestamp'])

month = [
    'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.month) >> nvt.ops.Categorify() >> nvt.ops.Rename(postfix='_month')
year = [
    'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.year) >> nvt.ops.Categorify() >> nvt.ops.Rename(postfix='_year')
features = year + month + ['timestamp']

workflow = nvt.Workflow(features)
workflow.fit(nvt.Dataset(df))
df_transformed = workflow.transform(nvt.Dataset(df))
df_transformed.schema

bschifferer avatar Sep 19 '22 18:09 bschifferer

@bschifferer - I'm struggling to decide if this behavior is a bug.

Your workflow is effectively applying Categorify on the same column twice (with different unique values). As far as I can tell, using nvt.ops.LambdaOp(lambda x: x.dt.month) should not change the name of the "timestamp" column. This means that you are writing the unique.timestamp.parquet file twice (once for "timestamp" overwritten by col.dt.month and then again for "timestamp" overwritten by col.dt.year).

I think the following workflow gives you the behavior you want:

month = [
    'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.month) >> nvt.ops.Rename(postfix='_month') >> nvt.ops.Categorify()
year = [
    'timestamp'
] >> nvt.ops.LambdaOp(lambda x: x.dt.year) >> nvt.ops.Rename(postfix='_year') >> nvt.ops.Categorify()
features = year + month + ['timestamp']

Do you feel that a LambdaOp should be renaming input columns by default? The default behavior is to simply overwrite the original column, and I'm not sure if changing this will break existing user code (or if it is a good idea).

Perhaps we can try to detect (and warn) when two or more fit operations are writing inconsistent statistics?

rjzamora avatar Oct 21 '22 17:10 rjzamora