RecGURU
RecGURU copied to clipboard
Exception on processing the collected data
I follow the guideline to download the kdd.tar.gz from provided URL and run the business_process py file.
But some exceptions happen in the generate_data function. It happens in lines 209-249.
except Exception as e: print(data_t["uid"][u_id]) print(data_t["a_item"][u_id]) print(wesee_items) print(data_t["b_item"][u_id]) print(video_items) sys.exit()
Is my download file is broken? The MD5 value of my download kdd.tar.gz is "a7e572a892b602552eaaa4203a8d7f14".
- make sure you downloaded the correct file (refer to: https://github.com/Chain123/RecGURU/issues/3).
- check the intermediate result (output of the remove_repeats function). If this function is running correctly, it is unlikely that the original file is broken.
- double-check all the paths, make sure the paths are all valid (exist on your computer)
- print out the error, and maybe remove the "try-except" to see what is wrong.
- make sure you downloaded the correct file (refer to: 请问collected_data如何处理? #3).
- check the intermediate result (output of the remove_repeats function). If this function is running correctly, it is unlikely that the original file is broken.
- double-check all the paths, make sure the paths are all valid (exist on your computer)
- print out the error, and maybe remove the "try-except" to see what is wrong.
The three steps are correct. I remove the "try-except" and find the error in these lines.
elif uid_t in overlap_ids:
# random
if u_id % 2 == 0:
a_only_data["seq"].append(wesee_items[:-2])
a_only_data["val"].append(wesee_items[-2])
a_only_data["test"].append(wesee_items[-1])
else:
b_only_data["seq"].append(video_items[:-2])
b_only_data["val"].append(video_items[-2])
b_only_data["test"].append(video_items[-1])
Then I print the uid, wesee_items, and video_items. It likes the following.
14
['248272', '411939', '98300', '180923', '268539', '55331', '265208', '388334', '440809', '318048', '304142', '283233', '7075',
'396527', '323243', '38977']
[179, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]
[]
[]
In the overlap user ids, there will be some user_id without the video_items or the wesee_items. If I use 'try-except' to avoid this case, will the generated dataset be different from the one used in your paper?
This is weird, cause if a user does not have behavior history in either domain, then it is not an overlapped user.
In the remove_repeats() function, we make sure that all the users in a_valid and b_valid have at least 5 history items in the corresponding domain. Then, we get the "overlap_ids" from the all_overlap_users() function (line 171) which simply returns the intersection of the a_valid and b_valid. Therefore, the users in the "overlap_ids" should have at 5 behavior histories in both domains. Have you run the remove_repeats() function correctly? Make sure you get the correct a_valid and b_valid sets.
If you just avoid this exception, then the final dataset might be a little different from my previous version. But it should not matter.
This is weird, cause if a user does not have behavior history in either domain, then it is not an overlapped user.
In the remove_repeats() function, we make sure that all the users in a_valid and b_valid have at least 5 history items in the corresponding domain. Then, we get the "overlap_ids" from the all_overlap_users() function (line 171) which simply returns the intersection of the a_valid and b_valid. Therefore, the users in the "overlap_ids" should have at 5 behavior histories in both domains. Have you run the remove_repeats() function correctly? Make sure you get the correct a_valid and b_valid sets.
If you just avoid this exception, then the final dataset might be a little different from my previous version. But it should not matter.
Thank u for u reply. Could you please upload the processed dataset? I found that it is hard to process the amazon dataset with the provided code. When processing the books and clohting, there will be out of memory.