RecTools
RecTools copied to clipboard
np.setdiff1d is too slow
If user_id and item_id columns are CategoryDType, then np.setdiff1d works very slowly on large volumes (>10 million unique ones)
Possible solution is to replace:
https://github.com/MobileTeleSystems/RecTools/blob/76c41e0e039cd050b46ec0f6cb7f0f668fca9574/rectools/model_selection/time_split.py#L146
with
new_users = set(df_test[Columns.User].unique()) - set(df_train[Columns.User].unique())
And same for https://github.com/MobileTeleSystems/RecTools/blob/76c41e0e039cd050b46ec0f6cb7f0f668fca9574/rectools/model_selection/time_split.py#L150