RecGURU Exception on processing the collected data

I follow the guideline to download the kdd.tar.gz from provided URL and run the business_process py file.

But some exceptions happen in the generate_data function. It happens in lines 209-249. except Exception as e: print(data_t["uid"][u_id]) print(data_t["a_item"][u_id]) print(wesee_items) print(data_t["b_item"][u_id]) print(video_items) sys.exit()

Is my download file is broken? The MD5 value of my download kdd.tar.gz is "a7e572a892b602552eaaa4203a8d7f14".

Jun 01 '22 16:06 WujiangXu

make sure you downloaded the correct file (refer to: https://github.com/Chain123/RecGURU/issues/3).
check the intermediate result (output of the remove_repeats function). If this function is running correctly, it is unlikely that the original file is broken.
double-check all the paths, make sure the paths are all valid (exist on your computer)
print out the error, and maybe remove the "try-except" to see what is wrong.

Jun 02 '22 02:06 Chain123

make sure you downloaded the correct file (refer to: 请问collected_data如何处理？ #3).

check the intermediate result (output of the remove_repeats function). If this function is running correctly, it is unlikely that the original file is broken.

double-check all the paths, make sure the paths are all valid (exist on your computer)

print out the error, and maybe remove the "try-except" to see what is wrong.

The three steps are correct. I remove the "try-except" and find the error in these lines.

 elif uid_t in overlap_ids:
    # random 
          if u_id % 2 == 0:
              a_only_data["seq"].append(wesee_items[:-2])
              a_only_data["val"].append(wesee_items[-2])
              a_only_data["test"].append(wesee_items[-1])
          else:
              b_only_data["seq"].append(video_items[:-2])
              b_only_data["val"].append(video_items[-2])
              b_only_data["test"].append(video_items[-1])

Then I print the uid, wesee_items, and video_items. It likes the following.

14
['248272', '411939', '98300', '180923', '268539', '55331', '265208', '388334', '440809', '318048', '304142', '283233', '7075', 
'396527', '323243', '38977']
[179, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]
[]
[]

In the overlap user ids, there will be some user_id without the video_items or the wesee_items. If I use 'try-except' to avoid this case, will the generated dataset be different from the one used in your paper?

Jun 02 '22 04:06 WujiangXu

This is weird, cause if a user does not have behavior history in either domain, then it is not an overlapped user.

In the remove_repeats() function, we make sure that all the users in a_valid and b_valid have at least 5 history items in the corresponding domain. Then, we get the "overlap_ids" from the all_overlap_users() function (line 171) which simply returns the intersection of the a_valid and b_valid. Therefore, the users in the "overlap_ids" should have at 5 behavior histories in both domains. Have you run the remove_repeats() function correctly? Make sure you get the correct a_valid and b_valid sets.

If you just avoid this exception, then the final dataset might be a little different from my previous version. But it should not matter.

Jun 08 '22 06:06 Chain123

This is weird, cause if a user does not have behavior history in either domain, then it is not an overlapped user.

In the remove_repeats() function, we make sure that all the users in a_valid and b_valid have at least 5 history items in the corresponding domain. Then, we get the "overlap_ids" from the all_overlap_users() function (line 171) which simply returns the intersection of the a_valid and b_valid. Therefore, the users in the "overlap_ids" should have at 5 behavior histories in both domains. Have you run the remove_repeats() function correctly? Make sure you get the correct a_valid and b_valid sets.

If you just avoid this exception, then the final dataset might be a little different from my previous version. But it should not matter.

Thank u for u reply. Could you please upload the processed dataset? I found that it is hard to process the amazon dataset with the provided code. When processing the books and clohting, there will be out of memory.

Jun 11 '22 07:06 WujiangXu

RecGURU RecGURU copied to clipboard

Exception on processing the collected data

RecGURU
RecGURU copied to clipboard