RecVAE icon indicating copy to clipboard operation
RecVAE copied to clipboard

confused about dataset split

Open junkangwu opened this issue 3 years ago • 3 comments

Hi, nice work about Variational Autoencoder on recommendation. However, I am confused about the method of data split which is the same way as 2018WWW-Variational autoencoders for collaborative filtering In the https://github.com/ilya-shenbin/RecVAE/blob/8b9b2ded3f215f9e30b45a9cc61199b67fc3da42/preprocessing.py#L60 unique_uid is the index of active user rather than the uid (unique_uid['userId']). Owing to the filter operator before, some userId are moved out. Then some valid userId at the end will not be considered if we adopt the index of user_activity rather than the actual uid. I guess it might be a error or is there any other meaning of that?

Looking forward to your reply, Thanks. Best.

junkangwu avatar Jul 23 '21 11:07 junkangwu

I have the same doubt. I am not sure why index is used instead of the actual uid ?

shashankg7 avatar Jul 25 '21 11:07 shashankg7

Hi,

I agree with you and I think its a bug in the code. Initially, I wasn't able to run the code and thought it was probably some data issue, and I went back to change the code as follows.

In preprocess.py

def filter_triplets(tp, min_uc=min_uc, min_sc=min_sc): 
    if min_sc > 0:
        itemcount = get_count(tp, 'movieId')
        tp = tp[tp['movieId'].isin(itemcount[itemcount >= min_sc].movieId)]
        # tp = tp[tp['movieId'].isin(itemcount.index[itemcount >= min_sc])]
    if min_uc > 0:
        usercount = get_count(tp, 'userId')
        tp = tp[tp['userId'].isin(usercount[usercount >= min_uc].userId)]
        # tp = tp[tp['userId'].isin(usercount.index[usercount >= min_uc])]
    
    usercount, itemcount = get_count(tp, 'userId').set_index('userId'), get_count(tp, 'movieId').set_index('movieId')

YvetteLi avatar Oct 04 '22 22:10 YvetteLi

thanks for your advice~

LiaoYunxi avatar Nov 30 '22 20:11 LiaoYunxi