[BUG] Categorify(start_index) is not generating the mappings in the unique parquet files as expected
Describe the bug
I noticed that when we use the start_index in the Categorify op, the generated unique category parquet files are not correctly mapping the original, encoded and null values in the columns unique category files.
Steps/Code to reproduce bug Please run the example below to repro the issue:
gdf = cudf.DataFrame(
{
"C1": [1, 1, 3, 3, 3] *2,
"C2": [1, 1, 1, 2, 2] *2
}
)
print(gdf)
cat_features = ["C1", "C2"] >> nvt.ops.Categorify(start_index=1)
train_dataset = nvt.Dataset(gdf)
workflow = nvt.Workflow(cat_features)
workflow.fit(train_dataset)
gdf_new = workflow.transform(train_dataset).to_ddf().compute()
print(gdf_new)
C1_mapping = cudf.read_parquet("./categories/unique.C1.parquet")
print(C1_mapping)
Expected behavior
when you print C1_mapping you will see that the mappings do not take start_index=1 into consideration (see below), and encoded values in gdf_new does not match with the index column in the unique.C1.parquet. In addition start_index=1 should encode nulls as 1 not as 0, however in the table below we see it encoded as 0.
C1 C1_size
0 <NA> 0
1 3 6
2 1 4
this should be
C1 C1_size
1 <NA> 0
2 3 6
3 1 4
Environment details (please complete the following information):
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
- Method of NVTabular install: [conda, Docker, or from source]
- If method of install is [Docker], provide
docker pull&docker runcommands used
- If method of install is [Docker], provide
Merlin-tensorflow-training:22.05
@rnyak , the status is marked done. I don't see any other comment here. Please confirm and close
Ronay confirmed that the issue is still open
@rjzamora could you pls take a look at this issue? thanks!
As far as I can tell, the start_index option (which I was actually unaware of until just now) only applies to how values will be encoded at transformation time. This parameter does not change how unique-value statistics are stored.
I'd prefer not to complicate the contents of uniques.*.parquet by making the index anything other than a simple [0, N) range. Is there any clear motivation to reflect the final encoding in the raw unique-value statistics?
Edit: Ah, I suppose the obvious motivation is a simple mechanism from inverse encoding (or to simply "check" the encoding result). I suppose we could store a shifted RangeIndex in the pandas metadata when start_index>0.
@sararb fyi.
@rjzamora thanks! for your comment I suppose we could store a shifted RangeIndex in the pandas metadata when start_index>0. that sounds good but I doubt it will be proper/accurate? we should verify this idea first.
this should be address based on these suggestions in here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1748