NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] Incorrect Reverse Mapping of Categories

Open SaadAhmed433 opened this issue 3 years ago • 6 comments

I transformed a data set using a saved NVTabular workflow. The dateset contains the following news_ids = ["N86358", "N81415", "N13279"].

The transformed data set converts the news_id into integer news ids. The transformed data set looks like this

image

And when I try to get to the original news ids than the result does not match to the news ids mentioned in the above list. image

Also another thing I have noticed is that if I search for these **news_ids = ["N86358", "N81415", "N13279"]**` in the data frame created from the unique.news_id.parquet file than there is always a difference of 1 between the converted data set news ids and the one i get if i do a manual search in the parquet file

image

O.S: Ubuntu 20 NVtabular version 0.9.0

SaadAhmed433 avatar Apr 15 '22 11:04 SaadAhmed433

NVTabular doesn't yet have an explicit reverse mapping from categorical ids to their original values so this seems like a reasonable workaround, but there's one thing to be aware of that makes it a bit less straightforward than it would seem:

By default, the Categorify operator reserves index 0 to encode null/missing values, which is likely the source of the off-by-one issue you're seeing.

karlhigley avatar Apr 15 '22 15:04 karlhigley

@karlhigley thank you for your response, yes I am aware that it encodes the null values to 0 by default. For example when I Ioad the mappings from the parquet file l get the following output. At index 0 we can see null values are encoded

import cudf
news_id = cudf.read_parquet("./data/data_chunks/workflow_etl_mind_2/categories/unique.news_id.parquet")
news_id = news_id.reset_index()
news_id.head()
index news_id news_id_size
0 NA 1
1 N98178 60638
2 N32154 47078
3 N30899 47003
4 N47257 42147

So i am assuming now if the integer id after transformation is 70 then when I search for the 70th index in the the news_id data frame then its going to be index 69 right because in python indexes start from 0.

Please let me know if my understanding is correct?

SaadAhmed433 avatar Apr 18 '22 07:04 SaadAhmed433

@SaadAhmed433 can you please generate a minimal reproducible example with your NVT pipeline. you can create a simple fake dataset of 10-20 rows and add your NVT code snippet, so we can better answer your question and reproduce the issue. Thanks.

for this statement

So i am assuming now if the integer id after transformation is 70 then when I search for the 70th index in the the news_id data frame then its going to be index 69 right because in python indexes start from 0.

the index in the news_id df actually means the new encoded id for your original news_id . so you should search for the exact value here.

rnyak avatar May 12 '22 13:05 rnyak

Sure, @rnyak , please see the code below.

The dataset used can be downloaded from here

import cudf
import nvtabular as nvt

#load the dataset
data = cudf.read_csv("./fake_data.csv")
data.head()
news_id category sub_cat
N23144 health weightloss
N86255 health medical
N93187 news newsworld
N75236 health voices
N99744 health medical
#define the categorical features for applying the Categorify op
cat_feats = ["news_id", "category", "sub_cat"]  >> nvt.ops.Categorify(start_index=1)

#instantiate the workflow
workflow = nvt.Workflow(cat_feats)

#create a dataset object
dataset = nvt.Dataset(data)

#applu and transform to get a dataframe
workflow.fit(dataset)
transfomed_df = workflow.transform(dataset).to_ddf()

#save the workflow
workflow_path = './test_workflow'
workflow.save(workflow_path)

Now I print the transformed dataset and compare

transformed_df = transformed_df.compute()
transformed_df.head()
news_id category sub_cat
38 4 12
93 4 23
96 3 5
82 4 11
101 4 23
news_id_mapping = cudf.read_parquet("./test_workflow/categories/unique.news_id.parquet")
category_mapping =  cudf.read_parquet("./test_workflow/categories/unique.category.parquet")
sub_cat =  cudf.read_parquet("./test_workflow/categories/unique.sub_cat.parquet")

Now when I select the 38 integer id in thetransformed_df and search it in the news_id_mapping data frame I always have to subtract 1 from the integer id otherwise it will give me incorrect mapping. The same happens with every other column.


print(news_id_mapping[news_id_mapping.index == 37], "\n")
print(category_mapping[category_mapping.index == 3], "\n")
print(sub_cat[sub_cat.index == 11], "\n")

image

If you select the exact ids you will get the wrong results.

For e.g

print(news_id_mapping[news_id_mapping.index == 38], "\n")
print(category_mapping[category_mapping.index == 4], "\n")
print(sub_cat[sub_cat.index == 12], "\n")

image

SaadAhmed433 avatar May 13 '22 13:05 SaadAhmed433

@SaadAhmed433 Now when I select the 38 integer id in thetransformed_df and search it in the news_id_mapping data frame I always have to subtract 1 from the integer id otherwise it will give me incorrect mapping.

this is because you add start_index=1 in the Categorify. If you do not add it, you can use the exact_id.

we will look into that and fix it. thanks.

rnyak avatar May 13 '22 20:05 rnyak

This is due to Categorify(start_index) is not generating the mappings in the unique parquet files as expected.

rnyak avatar May 13 '22 21:05 rnyak

@rnyak please link the ticket that updates dev for this bug.

viswa-nvidia avatar Nov 01 '22 17:11 viswa-nvidia

Closing this one. The related bug ticket is here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1544

rnyak avatar Nov 01 '22 17:11 rnyak