BERT4Rec-VAE-Pytorch
BERT4Rec-VAE-Pytorch copied to clipboard
About the dataset preprocessing part. I think the index of items and users should start at 1 not 0
def densify_index(self, df):
print('Densifying index')
umap = {u: i for i, u in enumerate(set(df['uid']))}
smap = {s: i for i, s in enumerate(set(df['sid']))}
df['uid'] = df['uid'].map(umap)
df['sid'] = df['sid'].map(smap)
return df, umap, smap
Is the 'umap' and 'smap' both beginning with 0? If so, then it is the same index with the label when the token was masked.
I think so, too
Yes I agree since he set tokens = tokens = [0] * mask_len + tokens. 0 should be reserved for padding (1, num_items) for the item indexes and num_items + 1 for the CLOZE_MASK_TOKEN here: https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch/blob/master/dataloaders/bert.py#L14
I am very sorry for the delayed response. I'm recently getting little free time to actively maintain this repository. Similar issues have arisen quite frequently, and a PR is welcome. Thanks.
Hi everyone, I am also using the code, and I too ran into this issue. It's been a while, but it doesn't seem to have been solved. I think the problem originates here: https://github.com/jaywonchung/BERT4Rec-VAE-Pytorch/blob/master/datasets/base.py#L134 You should put i+1 (in general, I would suggest doing the same with UserIDs). The problem is that, as far as I know, this variation affects most of the code. Does anyone have a simple solution or is it necessary to check all the code?
I agree, I think it's enough to generate smap from index 1, there is no need to modify umap
smap = {s: i+1 for i, s in enumerate(set(df['sid']))}