recommenders
recommenders copied to clipboard
[BUG] python_chrono_split could create time leakage
The current behavior of python_chrono_split is to take x% of each user/item as train and the rest is the test dataset. This could create time leakage because some items might become popular at some date t=0 but users who's train time happens before that t<0 time could be easily be recommended with that popular item from the future.
Yes that's true. It's a bit challenging to evaluate at different time points for each user. One option could be to make a global time split across all users. The consequence is some of the users might not have items in the test set.
Is this what you are thinking of, or would something else be more useful?
yes, the evaluation should prevent time leakages - training should be based on "past" and testing on "future". training should not have access to "future" data. "past" and "future" definitions must be global.
In a more sophisticated scenario, this "split" date between "past" and "future" could shift sequentially as in back-testing, see https://www.kdnuggets.com/2020/03/uber-unveils-service-backtesting-machine-learning-models-scale.html for more details.
yeah, that would be a nice way to evaluate the models over time if we can add in this functionality to have an absolute split time.
one issue is the timelines across users can vary dramatically. would you still envision the split happening based on ratios of the data or some explicitly provided timestamp(s)?