BERT4Rec-VAE-Pytorch About the dataset preprocessing part. I think the index of items and users should start at 1 not 0

About the dataset preprocessing part. I think the index of items and users should start at 1 not 0

Open Furyton opened this issue 4 years ago • 5 comments

def densify_index(self, df):
    print('Densifying index')
    umap = {u: i for i, u in enumerate(set(df['uid']))}
    smap = {s: i for i, s in enumerate(set(df['sid']))}
    df['uid'] = df['uid'].map(umap)
    df['sid'] = df['sid'].map(smap)
    return df, umap, smap

Is the 'umap' and 'smap' both beginning with 0? If so, then it is the same index with the label when the token was masked.

Nov 22 '20 09:11 Furyton

I think so, too

Dec 14 '20 07:12 yuanninesuns

Yes I agree since he set tokens = tokens = [0] * mask_len + tokens. 0 should be reserved for padding (1, num_items) for the item indexes and num_items + 1 for the CLOZE_MASK_TOKEN here: https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch/blob/master/dataloaders/bert.py#L14

Jan 02 '21 21:01 thomalm

I am very sorry for the delayed response. I'm recently getting little free time to actively maintain this repository. Similar issues have arisen quite frequently, and a PR is welcome. Thanks.

Jan 03 '21 11:01 jaywonchung

Hi everyone, I am also using the code, and I too ran into this issue. It's been a while, but it doesn't seem to have been solved. I think the problem originates here: https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch/blob/master/datasets/base.py#L134 You should put i+1 (in general, I would suggest doing the same with UserIDs). The problem is that, as far as I know, this variation affects most of the code. Does anyone have a simple solution or is it necessary to check all the code?

Aug 29 '22 17:08 federicosiciliano

I agree, I think it's enough to generate smap from index 1, there is no need to modify umap

smap = {s: i+1 for i, s in enumerate(set(df['sid']))}

Mar 06 '23 09:03 redhated

BERT4Rec-VAE-Pytorch BERT4Rec-VAE-Pytorch copied to clipboard

About the dataset preprocessing part. I think the index of items and users should start at 1 not 0

BERT4Rec-VAE-Pytorch
BERT4Rec-VAE-Pytorch copied to clipboard