DeepCF icon indicating copy to clipboard operation
DeepCF copied to clipboard

Question about the AMusic dataset

Open hegongshan opened this issue 4 years ago • 8 comments

https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.train.rating#L517

The user whose ID is equal to 21 doesn't exist in the train set while exists in the test set.

https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.test.rating#L22

According the original paper,

The ml-1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way.

while the user whose ID is equal to 20 just has 2 ratings in the train set.

https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.train.rating#L515

hegongshan avatar Oct 21 '20 12:10 hegongshan

Hi~@hegongshan. Thank you for your interest in our work. Notice that the preprocessing procedure starts with filtering users and then items, so the items that a user has interacted with may change after filtering items, e.g., the # of records for user 20 decreased to 2 and the # of records for user 21 decreased to 1. Following (He et al. 2017), we adopt the leave-one-out evaluation method, which accidentally divides the record of user 21 into the test set.

This problem is unexpected, but it doesn't affect the fairness of the comparison experiments. Thank you for pointing this out.

familyld avatar Oct 21 '20 13:10 familyld

Hi~@hegongshan. Thank you for your interest in our work. Notice that the preprocessing procedure starts with filtering users and then items, so the items that a user has interacted with may change after filtering items, e.g., the # of records for user 20 decreased to 2 and the # of records for user 21 decreased to 1. Following (He et al. 2017), we adopt the leave-one-out evaluation method, which accidentally divides the record of user 21 into the test set.

This problem is unexpected, but it doesn't affect the fairness of the comparison experiments. Thank you for pointing this out.

Thank you for your reply. And I have another question.

In the dataset.py, the shape of trainMatrix is (maxUserId + 1, maxItemID + 1) where maxUserId (maxItemId) is the maximum of the user (or item) ID in the train set.

https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Dataset.py#L51

There are 12929 items in the AMusic dataset while the maximum of the items' ID is equal to 12925 in the train set.

https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.train.rating#L5401

However, there exist some IDs which are greater than 12925 in the test set. https://github.com/familyld/DeepCF/blob/0d91a5ddacde4ec6de16972cb7f22a74c7c9e1a8/Data/AMusic.test.rating#L1012

Owing to the above question, when we try to evaluate the performance of the model, the codes can't run successfully in the AMusic dataset.

Have you ever come across this problem?

hegongshan avatar Oct 21 '20 14:10 hegongshan

@hegongshan The original version (which is quite messy) runs successfully since I manually set the shape for each dataset, but later I clean up the code for better readability. So a quick way to solve this problem is to manually set the shape, and I should correct these errors once I have time. Thank you.

familyld avatar Oct 21 '20 15:10 familyld

@hegongshan The original version (which is quite messy) runs successfully since I manually set the shape for each dataset, but later I clean up the code for better readability. So a quick way to solve this problem is to manually set the shape, and I should correct these errors once I have time. Thank you.

Ok. Thank you for answering!

hegongshan avatar Oct 21 '20 16:10 hegongshan

Hi~@hegongshan. Thank you for your interest in our work. Notice that the preprocessing procedure starts with filtering users and then items, so the items that a user has interacted with may change after filtering items, e.g., the # of records for user 20 decreased to 2 and the # of records for user 21 decreased to 1. Following (He et al. 2017), we adopt the leave-one-out evaluation method, which accidentally divides the record of user 21 into the test set.

This problem is unexpected, but it doesn't affect the fairness of the comparison experiments. Thank you for pointing this out.

Hello! When I tried to process the AMusic dataset, I found another question.

Amazon Music The statistics of the original dataset are summarized in the following table.

#of User #of Items #of Ratings
478235 266414 836006

According the original paper,

Each user has at least 20 ratings and each item has been rated by at least 5 users.

Following the original paper, I try to reproduce it. My codes are as follows:

from typing import List, Dict


class AmazonRating(object):
    def __init__(self,
                 user_id: str,
                 item_id: str,
                 rating: int,
                 timestamp: int):
        self.user_id = user_id
        self.item_id = item_id
        self.rating = rating
        self.timestamp = timestamp


def process_amazon_music():
    user_item_dict: Dict[List] = {}
    item_user_dict: Dict[List] = {}
    with open('data/ratings_Digital_Music.csv') as file:
        for line in file.readlines():
            user, item, rating, timestamp = line.split(',')
            r = Rating(user, item, int(float(rating)), int(timestamp))

            user_item_dict.setdefault(user, [])
            user_item_dict[user].append(r)

            item_user_dict.setdefault(item, [])
            item_user_dict[item].append(r)

    user_item_filter_dict = {u: user_item_dict[u] for u in user_item_dict.keys() if len(user_item_dict[u]) >= 20}
    #
    # for u, ratings in user_item_dict.items():
    #     for r in ratings:
    #         item_user_dict.setdefault(r.item_id, [])
    #         item_user_dict[r.item_id].append(r)

    item_user_filter_dict = {i: item_user_dict[i] for i in item_user_dict.keys() if len(item_user_dict[i]) >= 5}

    interaction = 0
    final_item_user_dict = {}
    for u, rating_records in user_item_filter_dict.items():
        for record in rating_records:
            # if item is removed
            if record.item_id not in item_user_filter_dict.keys():
                rating_records.remove(record)
            else:
                interaction += 1
            final_item_user_dict.setdefault(record.item_id, [])
            final_item_user_dict[record.item_id] = record

    print('user: %d, item: %d, interaction: %d' % (len(user_item_filter_dict), len(final_item_user_dict), interaction))

~~The output of the above codes is:~~ ~~> user: 1835, item: 28406, interaction: 40853~~

~~which is inconsistent with the results shown in the original paper.~~

image

hegongshan avatar Mar 05 '21 08:03 hegongshan

For reference, the context in the paper is as below.

The ml-1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way.

For the other three data sets, we preprocess them in a similar way but you might also find it difficult to guarantee they satisfy the abovementioned property. As a result, they are processed for more than one time. The statistics of the final data sets are presented in Table 1, and the data sets provided in this repo are consistent with these figures.

The phrase "in the same way" is actually misleading. Thanks for pointing this out and I will keep this issue open until I correct this error and upload a new version on ArXiv.

By the way, one doesn't need to stick with the data sets we shared. Your code seems correct and please feel free to use your own data sets to test different algorithms/models. Good luck.

familyld avatar Mar 05 '21 13:03 familyld

For reference, the context in the paper is as below.

The ml-1m dataset has been preprocessed by the provider. Each user has at least 20 ratings and each item has been rated by at least 5 users. We process the other three datasets in the same way.

For the other three data sets, we preprocess them in a similar way but you might also find it difficult to guarantee they satisfy the abovementioned property. As a result, they are processed for more than one time. The statistics of the final data sets are presented in Table 1, and the data sets provided in this repo are consistent with these figures.

The phrase "in the same way" is actually misleading. Thanks for pointing this out and I will keep this issue open until I correct this error and upload a new version on ArXiv.

By the way, one doesn't need to stick with the data sets we shared. Your code is correct and please feel free to use your own data sets to test different algorithms/models. Good luck.

I'm so sorry! It's my fault!

I found that my code mentioned above had some mistake!

The statistics of the AMusic dataset in the original paper are correct!

That is to say, the number of users and items are 1776 and 12929, respectively.

data_list = []
final_item_set = set()
final_user_set = set()
for u, rating_records in user_item_filter_dict.items():
    for record in rating_records:
        if record.item_id in item_user_filter_dict.keys():
            data_list.append(record)
            final_item_set.add(record.item_id)
            final_user_set.add(record.user_id)
print('user: %d, item: %d, interaction: %d' % (len(final_user_set), len(final_item_set), len(data_list)))

user: 1776, item: 12929, interaction: 46087

hegongshan avatar Mar 05 '21 17:03 hegongshan

Huh,I only read the preprocessing part and that seems correct. Happy for you.

familyld avatar Mar 06 '21 01:03 familyld