RecGURU icon indicating copy to clipboard operation
RecGURU copied to clipboard

Exception on processing the collected data

Open WujiangXu opened this issue 3 years ago • 4 comments

I follow the guideline to download the kdd.tar.gz from provided URL and run the business_process py file.

But some exceptions happen in the generate_data function. It happens in lines 209-249. except Exception as e: print(data_t["uid"][u_id]) print(data_t["a_item"][u_id]) print(wesee_items) print(data_t["b_item"][u_id]) print(video_items) sys.exit()

Is my download file is broken? The MD5 value of my download kdd.tar.gz is "a7e572a892b602552eaaa4203a8d7f14".

WujiangXu avatar Jun 01 '22 16:06 WujiangXu

  1. make sure you downloaded the correct file (refer to: https://github.com/Chain123/RecGURU/issues/3).
  2. check the intermediate result (output of the remove_repeats function). If this function is running correctly, it is unlikely that the original file is broken.
  3. double-check all the paths, make sure the paths are all valid (exist on your computer)
  4. print out the error, and maybe remove the "try-except" to see what is wrong.

Chain123 avatar Jun 02 '22 02:06 Chain123

  1. make sure you downloaded the correct file (refer to: 请问collected_data如何处理? #3).
  2. check the intermediate result (output of the remove_repeats function). If this function is running correctly, it is unlikely that the original file is broken.
  3. double-check all the paths, make sure the paths are all valid (exist on your computer)
  4. print out the error, and maybe remove the "try-except" to see what is wrong.

The three steps are correct. I remove the "try-except" and find the error in these lines.

 elif uid_t in overlap_ids:
    # random 
          if u_id % 2 == 0:
              a_only_data["seq"].append(wesee_items[:-2])
              a_only_data["val"].append(wesee_items[-2])
              a_only_data["test"].append(wesee_items[-1])
          else:
              b_only_data["seq"].append(video_items[:-2])
              b_only_data["val"].append(video_items[-2])
              b_only_data["test"].append(video_items[-1])

Then I print the uid, wesee_items, and video_items. It likes the following.

14
['248272', '411939', '98300', '180923', '268539', '55331', '265208', '388334', '440809', '318048', '304142', '283233', '7075', 
'396527', '323243', '38977']
[179, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223]
[]
[]

In the overlap user ids, there will be some user_id without the video_items or the wesee_items. If I use 'try-except' to avoid this case, will the generated dataset be different from the one used in your paper?

WujiangXu avatar Jun 02 '22 04:06 WujiangXu

This is weird, cause if a user does not have behavior history in either domain, then it is not an overlapped user.

In the remove_repeats() function, we make sure that all the users in a_valid and b_valid have at least 5 history items in the corresponding domain. Then, we get the "overlap_ids" from the all_overlap_users() function (line 171) which simply returns the intersection of the a_valid and b_valid. Therefore, the users in the "overlap_ids" should have at 5 behavior histories in both domains. Have you run the remove_repeats() function correctly? Make sure you get the correct a_valid and b_valid sets.

If you just avoid this exception, then the final dataset might be a little different from my previous version. But it should not matter.

Chain123 avatar Jun 08 '22 06:06 Chain123

This is weird, cause if a user does not have behavior history in either domain, then it is not an overlapped user.

In the remove_repeats() function, we make sure that all the users in a_valid and b_valid have at least 5 history items in the corresponding domain. Then, we get the "overlap_ids" from the all_overlap_users() function (line 171) which simply returns the intersection of the a_valid and b_valid. Therefore, the users in the "overlap_ids" should have at 5 behavior histories in both domains. Have you run the remove_repeats() function correctly? Make sure you get the correct a_valid and b_valid sets.

If you just avoid this exception, then the final dataset might be a little different from my previous version. But it should not matter.

Thank u for u reply. Could you please upload the processed dataset? I found that it is hard to process the amazon dataset with the provided code. When processing the books and clohting, there will be out of memory.

WujiangXu avatar Jun 11 '22 07:06 WujiangXu