confused about dataset split
Hi, nice work about Variational Autoencoder on recommendation. However, I am confused about the method of data split which is the same way as 2018WWW-Variational autoencoders for collaborative filtering
In the https://github.com/ilya-shenbin/RecVAE/blob/8b9b2ded3f215f9e30b45a9cc61199b67fc3da42/preprocessing.py#L60
unique_uid is the index of active user rather than the uid (unique_uid['userId']). Owing to the filter operator before, some userId are moved out. Then some valid userId at the end will not be considered if we adopt the index of user_activity rather than the actual uid. I guess it might be a error or is there any other meaning of that?
Looking forward to your reply, Thanks. Best.
I have the same doubt. I am not sure why index is used instead of the actual uid ?
Hi,
I agree with you and I think its a bug in the code. Initially, I wasn't able to run the code and thought it was probably some data issue, and I went back to change the code as follows.
In preprocess.py
def filter_triplets(tp, min_uc=min_uc, min_sc=min_sc):
if min_sc > 0:
itemcount = get_count(tp, 'movieId')
tp = tp[tp['movieId'].isin(itemcount[itemcount >= min_sc].movieId)]
# tp = tp[tp['movieId'].isin(itemcount.index[itemcount >= min_sc])]
if min_uc > 0:
usercount = get_count(tp, 'userId')
tp = tp[tp['userId'].isin(usercount[usercount >= min_uc].userId)]
# tp = tp[tp['userId'].isin(usercount.index[usercount >= min_uc])]
usercount, itemcount = get_count(tp, 'userId').set_index('userId'), get_count(tp, 'movieId').set_index('movieId')
thanks for your advice~