BERT4Rec-VAE-Pytorch icon indicating copy to clipboard operation
BERT4Rec-VAE-Pytorch copied to clipboard

About the dataset preprocessing part. I think the index of items and users should start at 1 not 0

Open Furyton opened this issue 4 years ago • 5 comments

def densify_index(self, df):
    print('Densifying index')
    umap = {u: i for i, u in enumerate(set(df['uid']))}
    smap = {s: i for i, s in enumerate(set(df['sid']))}
    df['uid'] = df['uid'].map(umap)
    df['sid'] = df['sid'].map(smap)
    return df, umap, smap

Is the 'umap' and 'smap' both beginning with 0? If so, then it is the same index with the label when the token was masked.

Furyton avatar Nov 22 '20 09:11 Furyton

I think so, too

yuanninesuns avatar Dec 14 '20 07:12 yuanninesuns

Yes I agree since he set tokens = tokens = [0] * mask_len + tokens. 0 should be reserved for padding (1, num_items) for the item indexes and num_items + 1 for the CLOZE_MASK_TOKEN here: https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch/blob/master/dataloaders/bert.py#L14

thomalm avatar Jan 02 '21 21:01 thomalm

I am very sorry for the delayed response. I'm recently getting little free time to actively maintain this repository. Similar issues have arisen quite frequently, and a PR is welcome. Thanks.

jaywonchung avatar Jan 03 '21 11:01 jaywonchung

Hi everyone, I am also using the code, and I too ran into this issue. It's been a while, but it doesn't seem to have been solved. I think the problem originates here: https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch/blob/master/datasets/base.py#L134 You should put i+1 (in general, I would suggest doing the same with UserIDs). The problem is that, as far as I know, this variation affects most of the code. Does anyone have a simple solution or is it necessary to check all the code?

federicosiciliano avatar Aug 29 '22 17:08 federicosiciliano

I agree, I think it's enough to generate smap from index 1, there is no need to modify umap

smap = {s: i+1 for i, s in enumerate(set(df['sid']))}

redhated avatar Mar 06 '23 09:03 redhated