recommenders icon indicating copy to clipboard operation
recommenders copied to clipboard

[BUG] python_chrono_split could create time leakage

Open chanansh opened this issue 2 years ago • 3 comments

The current behavior of python_chrono_split is to take x% of each user/item as train and the rest is the test dataset. This could create time leakage because some items might become popular at some date t=0 but users who's train time happens before that t<0 time could be easily be recommended with that popular item from the future.

chanansh avatar Mar 14 '22 21:03 chanansh

Yes that's true. It's a bit challenging to evaluate at different time points for each user. One option could be to make a global time split across all users. The consequence is some of the users might not have items in the test set.

Is this what you are thinking of, or would something else be more useful?

gramhagen avatar Mar 15 '22 10:03 gramhagen

yes, the evaluation should prevent time leakages - training should be based on "past" and testing on "future". training should not have access to "future" data. "past" and "future" definitions must be global.

In a more sophisticated scenario, this "split" date between "past" and "future" could shift sequentially as in back-testing, see https://www.kdnuggets.com/2020/03/uber-unveils-service-backtesting-machine-learning-models-scale.html for more details.

image

chanansh avatar Mar 22 '22 15:03 chanansh

yeah, that would be a nice way to evaluate the models over time if we can add in this functionality to have an absolute split time.

one issue is the timelines across users can vary dramatically. would you still envision the split happening based on ratios of the data or some explicitly provided timestamp(s)?

gramhagen avatar Mar 22 '22 18:03 gramhagen