RecBole [🐛BUG] full_sort_topk for k = 10 produces a list with 22 elements, each element a 10-item list, per user

[🐛BUG] full_sort_topk for k = 10 produces a list with 22 elements, each element a 10-item list, per user

Open NataliaVConnolly opened this issue 2 years ago • 3 comments

trafficstars

I am trying to follow this example to get top k items recommended for a single user. My code is:

   external_user_ids = ['1234']
   dataset.token2id(dataset.uid_field, external_user_ids)
   topk_score, topk_iid_list = full_sort_topk(uid_series, model, train_data, k=10, device=config['device']) 
   external_item_list = dataset.id2token(dataset.iid_field, topk_iid_list.cpu())
   print(external_item_list)  # should be external tokens of top 10 items

which is pretty much what the example code does. But I don't get back 10 items - I get 220, in this format:

[['1','2','3','4','5','6','7','8','9,'10']
 ['1','2','3','4','5','6','7','8','9,'10']
...
'1','2','3','4','5','6','7','8','9,'10']]

(I replaced the actual item id's with '1','2', and so on).

So basically I get back a list of 22 elements, each element containing 10 external item id's. Why does this happen, instead of the 10 items that are obtained in the example code referenced above?

To make it even more confusing, none of these 22 sets of 10 items contain the same items as what I get for this specific user when I run predict_for_all_item, as in this example. The predict_for_all_item predicted items make a lot more sense.

Thank you for your help!

Nov 09 '23 02:11 NataliaVConnolly

@NataliaVConnolly Hello! We suggest that you refer to the code implementation here, and also try printing the results to check if the intermediate values meet expectations, such as uid_series.

Nov 15 '23 15:11 zhengbw0324

Thank you for your comment! Yes, that's what I tried to do. I used this code


    topk_score, topk_iid_list = full_sort_topk(
        uid_series, model, test_data, k=10, device=config["device"]
    )
    print(topk_score)  # scores of top 10 items
    print(topk_iid_list)  # internal id of top 10 items
    external_item_list = dataset.id2token(dataset.iid_field, topk_iid_list.cpu())
    print(external_item_list)  # external tokens of top 10 items
    print()

and for a list of 2 external_user_ids, uid_series had 2 elements, as expected. However, topk_score and topk_iid_list were both tensors containing 46 10-item lists, not at all like the tutorial that produced 2 10-item lists for a 2-element uid_series.

One difference is that I used train_data, not test_data in full_sort_topk, as I didn't have a test dataset, just training (I was interested in scoring all users, not assessing the performance). Could that have made a difference?

Nov 16 '23 00:11 NataliaVConnolly

@NataliaVConnolly Hello, this is because you used training data. Our test set is organized on a user-by-user basis, which means that each user only conducts one test. However, the train set may contain multiple data from the same user, which is why your return results will exceed 2 pieces.

Nov 16 '23 09:11 zhengbw0324

RecBole RecBole copied to clipboard

[🐛BUG] full_sort_topk for k = 10 produces a list with 22 elements, each element a 10-item list, per user

RecBole
RecBole copied to clipboard