DeepCF
DeepCF copied to clipboard
Question about the AMusic dataset
https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.train.rating#L517
The user whose ID is equal to 21 doesn't exist in the train set while exists in the test set.
https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.test.rating#L22
According the original paper,
The ml-1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way.
while the user whose ID is equal to 20 just has 2 ratings in the train set.
https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.train.rating#L515
Hi~@hegongshan. Thank you for your interest in our work. Notice that the preprocessing procedure starts with filtering users and then items, so the items that a user has interacted with may change after filtering items, e.g., the # of records for user 20 decreased to 2 and the # of records for user 21 decreased to 1. Following (He et al. 2017), we adopt the leave-one-out evaluation method, which accidentally divides the record of user 21 into the test set.
This problem is unexpected, but it doesn't affect the fairness of the comparison experiments. Thank you for pointing this out.
Hi~@hegongshan. Thank you for your interest in our work. Notice that the preprocessing procedure starts with filtering users and then items, so the items that a user has interacted with may change after filtering items, e.g., the # of records for user 20 decreased to 2 and the # of records for user 21 decreased to 1. Following (He et al. 2017), we adopt the leave-one-out evaluation method, which accidentally divides the record of user 21 into the test set.
This problem is unexpected, but it doesn't affect the fairness of the comparison experiments. Thank you for pointing this out.
Thank you for your reply. And I have another question.
In the dataset.py, the shape of trainMatrix is (maxUserId + 1, maxItemID + 1) where maxUserId (maxItemId) is the maximum of the user (or item) ID in the train set.
https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Dataset.py#L51
There are 12929 items in the AMusic dataset while the maximum of the items' ID is equal to 12925 in the train set.
https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.train.rating#L5401
However, there exist some IDs which are greater than 12925 in the test set. https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.test.rating#L1012
Owing to the above question, when we try to evaluate the performance of the model, the codes can't run successfully in the AMusic dataset.
Have you ever come across this problem?
@hegongshan The original version (which is quite messy) runs successfully since I manually set the shape for each dataset, but later I clean up the code for better readability. So a quick way to solve this problem is to manually set the shape, and I should correct these errors once I have time. Thank you.
@hegongshan The original version (which is quite messy) runs successfully since I manually set the shape for each dataset, but later I clean up the code for better readability. So a quick way to solve this problem is to manually set the shape, and I should correct these errors once I have time. Thank you.
Ok. Thank you for answering!
Hi~@hegongshan. Thank you for your interest in our work. Notice that the preprocessing procedure starts with filtering users and then items, so the items that a user has interacted with may change after filtering items, e.g., the # of records for user 20 decreased to 2 and the # of records for user 21 decreased to 1. Following (He et al. 2017), we adopt the leave-one-out evaluation method, which accidentally divides the record of user 21 into the test set.
This problem is unexpected, but it doesn't affect the fairness of the comparison experiments. Thank you for pointing this out.
Hello! When I tried to process the AMusic dataset, I found another question.
Amazon Music The statistics of the original dataset are summarized in the following table.
#of User | #of Items | #of Ratings |
---|---|---|
478235 | 266414 | 836006 |
According the original paper,
Each user has at least 20 ratings and each item has been rated by at least 5 users.
Following the original paper, I try to reproduce it. My codes are as follows:
from typing import List, Dict
class AmazonRating(object):
def __init__(self,
user_id: str,
item_id: str,
rating: int,
timestamp: int):
self.user_id = user_id
self.item_id = item_id
self.rating = rating
self.timestamp = timestamp
def process_amazon_music():
user_item_dict: Dict[List] = {}
item_user_dict: Dict[List] = {}
with open('data/ratings_Digital_Music.csv') as file:
for line in file.readlines():
user, item, rating, timestamp = line.split(',')
r = Rating(user, item, int(float(rating)), int(timestamp))
user_item_dict.setdefault(user, [])
user_item_dict[user].append(r)
item_user_dict.setdefault(item, [])
item_user_dict[item].append(r)
user_item_filter_dict = {u: user_item_dict[u] for u in user_item_dict.keys() if len(user_item_dict[u]) >= 20}
#
# for u, ratings in user_item_dict.items():
# for r in ratings:
# item_user_dict.setdefault(r.item_id, [])
# item_user_dict[r.item_id].append(r)
item_user_filter_dict = {i: item_user_dict[i] for i in item_user_dict.keys() if len(item_user_dict[i]) >= 5}
interaction = 0
final_item_user_dict = {}
for u, rating_records in user_item_filter_dict.items():
for record in rating_records:
# if item is removed
if record.item_id not in item_user_filter_dict.keys():
rating_records.remove(record)
else:
interaction += 1
final_item_user_dict.setdefault(record.item_id, [])
final_item_user_dict[record.item_id] = record
print('user: %d, item: %d, interaction: %d' % (len(user_item_filter_dict), len(final_item_user_dict), interaction))
~~The output of the above codes is:~~ ~~> user: 1835, item: 28406, interaction: 40853~~
~~which is inconsistent with the results shown in the original paper.~~
For reference, the context in the paper is as below.
The ml-1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way.
For the other three data sets, we preprocess them in a similar way but you might also find it difficult to guarantee they satisfy the abovementioned property. As a result, they are processed for more than one time. The statistics of the final data sets are presented in Table 1, and the data sets provided in this repo are consistent with these figures.
The phrase "in the same way" is actually misleading. Thanks for pointing this out and I will keep this issue open until I correct this error and upload a new version on ArXiv.
By the way, one doesn't need to stick with the data sets we shared. Your code seems correct and please feel free to use your own data sets to test different algorithms/models. Good luck.
For reference, the context in the paper is as below.
The ml-1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way.
For the other three data sets, we preprocess them in a similar way but you might also find it difficult to guarantee they satisfy the abovementioned property. As a result, they are processed for more than one time. The statistics of the final data sets are presented in Table 1, and the data sets provided in this repo are consistent with these figures.
The phrase "in the same way" is actually misleading. Thanks for pointing this out and I will keep this issue open until I correct this error and upload a new version on ArXiv.
By the way, one doesn't need to stick with the data sets we shared. Your code is correct and please feel free to use your own data sets to test different algorithms/models. Good luck.
I'm so sorry! It's my fault!
I found that my code mentioned above had some mistake!
The statistics of the AMusic dataset in the original paper are correct!
That is to say, the number of users and items are 1776 and 12929, respectively.
data_list = []
final_item_set = set()
final_user_set = set()
for u, rating_records in user_item_filter_dict.items():
for record in rating_records:
if record.item_id in item_user_filter_dict.keys():
data_list.append(record)
final_item_set.add(record.item_id)
final_user_set.add(record.user_id)
print('user: %d, item: %d, interaction: %d' % (len(final_user_set), len(final_item_set), len(data_list)))
user: 1776, item: 12929, interaction: 46087
Huh,I only read the preprocessing part and that seems correct. Happy for you.