NVTabular [BUG] Categorify(start_index) is not generating the mappings in the unique parquet files as expected

Describe the bug

I noticed that when we use the start_index in the Categorify op, the generated unique category parquet files are not correctly mapping the original, encoded and null values in the columns unique category files.

Steps/Code to reproduce bug Please run the example below to repro the issue:

gdf = cudf.DataFrame(
    {
        "C1": [1, 1, 3, 3, 3] *2,
        "C2": [1, 1, 1, 2, 2] *2
    }
)

print(gdf)
cat_features = ["C1", "C2"] >> nvt.ops.Categorify(start_index=1)

train_dataset = nvt.Dataset(gdf)

workflow = nvt.Workflow(cat_features)
workflow.fit(train_dataset)
gdf_new = workflow.transform(train_dataset).to_ddf().compute()
print(gdf_new)

C1_mapping = cudf.read_parquet("./categories/unique.C1.parquet")
print(C1_mapping)

Expected behavior

when you print C1_mapping you will see that the mappings do not take start_index=1 into consideration (see below), and encoded values in gdf_new does not match with the index column in the unique.C1.parquet. In addition start_index=1 should encode nulls as 1 not as 0, however in the table below we see it encoded as 0.

	C1	C1_size
0	<NA>	0
1	3	6
2	1	4

this should be

	C1	C1_size
1	<NA>	0
2	3	6
3	1	4

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of NVTabular install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

Merlin-tensorflow-training:22.05

May 13 '22 20:05 rnyak

@rnyak , the status is marked done. I don't see any other comment here. Please confirm and close

Jun 09 '22 00:06 viswa-nvidia

Ronay confirmed that the issue is still open

Jun 14 '22 21:06 viswa-nvidia

@rjzamora could you pls take a look at this issue? thanks!

Nov 01 '22 17:11 rnyak

As far as I can tell, the start_index option (which I was actually unaware of until just now) only applies to how values will be encoded at transformation time. This parameter does not change how unique-value statistics are stored.

I'd prefer not to complicate the contents of uniques.*.parquet by making the index anything other than a simple [0, N) range. Is there any clear motivation to reflect the final encoding in the raw unique-value statistics?

Edit: Ah, I suppose the obvious motivation is a simple mechanism from inverse encoding (or to simply "check" the encoding result). I suppose we could store a shifted RangeIndex in the pandas metadata when start_index>0.

Nov 01 '22 21:11 rjzamora

@sararb fyi.

Dec 01 '22 21:12 rnyak

@rjzamora thanks! for your comment I suppose we could store a shifted RangeIndex in the pandas metadata when start_index>0. that sounds good but I doubt it will be proper/accurate? we should verify this idea first.

Dec 01 '22 22:12 rnyak

this should be address based on these suggestions in here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1748

Jan 26 '23 13:01 rnyak