pytorch-forecasting icon indicating copy to clipboard operation
pytorch-forecasting copied to clipboard

Data leakage with GroupNormalizer in TimeSeriesDataset

Open LuisPerezVazquez opened this issue 3 years ago • 1 comments

PyTorch-Forecasting version: 0.10.2 PyTorch version:1.12.1 Python version:3.10.4 Operating System: Linux

Expected behavior

When creating a TimeSeriesDataset object from a pandas DataFrame, if we set predit_mode=True the target_normalizer (GroupNormalizer) should not be fitted to the future, a priori unknown, true target values, since this is data leakage, but only to those data points up to (length-max_prediction_length).

Actual behavior

We have an already trained forecasting model. We want to create a TimeSeriesDataset from a pandas DataFrame for inference with this model, making use of the parameters predict_mode=True and max_prediction_length so we use part of the data as encoding samples and the rest as prediction samples (see predict_mode in documentation).

Indeed, we see in the code how these parameters make the instantiated TimeSeriesDataset object use part of the data for encoding and the rest for prediction. However, the target_normalizer (GroupNormalizer), is fitted in all target values, past and future, effectively incurring in data leakage: the predictions are not independent of the a priori unknown future values we give to the object.

My current workaround is instantiating first an additional TimeSeriesDataset object on the data prior to the forecasting date, and then passing its fitted target_normalizer to the desired TimeSeriesDataset used for inference, to prevent the data leakage, but this is highly inefficient (you are defining almost the same object twice). A more efficient approach would be to directly fitting a GroupNormalizer on the data and passing it, but this gives an error TypeError: '<' not supported between instances of 'int' and 'str', probably because X’s index (named id) is encoded and self.norm_’s index (name id too) is not encoded (if you have a workaround for this that would also be helpful).

Code to reproduce the problem

inference = TimeSeriesDataSet(
    df,
    time_idx="time_idx",
    target="sales",
    group_ids=["id"],
    max_encoder_length=max_encoder_length,
    max_prediction_length=max_prediction_length,
    target_normalizer=GroupNormalizer(groups=["id"]), 
    predict_mode=True
)

LuisPerezVazquez avatar Aug 24 '22 17:08 LuisPerezVazquez

Any update about this?

FrancescoFondaco avatar Sep 21 '22 13:09 FrancescoFondaco

Hello,

I'm not 100% sure, but from what I've understood in the code, if you use the from_dataset method to create your inference dataset from your trainining dataset, I don't think the GroupNormalizer is fitted again, so there's no future leakage. However, in your code example, as you're creating inference from scratch, you do have leakage. In this particular case I think playing with the min_prediction_idx parameter might do the trick, but I haven't checked the code.

Antoine-Schwartz avatar Jun 08 '23 15:06 Antoine-Schwartz

The TimeSeriesDataSet target normaliser is also impacting us to decoupling training and inference: https://github.com/jdb78/pytorch-forecasting/issues/1345

andre-marcos-perez avatar Jul 21 '23 08:07 andre-marcos-perez