pytorch-forecasting
pytorch-forecasting copied to clipboard
Data leakage with GroupNormalizer in TimeSeriesDataset
PyTorch-Forecasting version: 0.10.2 PyTorch version:1.12.1 Python version:3.10.4 Operating System: Linux
Expected behavior
When creating a TimeSeriesDataset object from a pandas DataFrame, if we set predit_mode=True the target_normalizer (GroupNormalizer) should not be fitted to the future, a priori unknown, true target values, since this is data leakage, but only to those data points up to (length-max_prediction_length).
Actual behavior
We have an already trained forecasting model. We want to create a TimeSeriesDataset from a pandas DataFrame for inference with this model, making use of the parameters predict_mode=True and max_prediction_length so we use part of the data as encoding samples and the rest as prediction samples (see predict_mode in documentation).
Indeed, we see in the code how these parameters make the instantiated TimeSeriesDataset object use part of the data for encoding and the rest for prediction. However, the target_normalizer (GroupNormalizer), is fitted in all target values, past and future, effectively incurring in data leakage: the predictions are not independent of the a priori unknown future values we give to the object.
My current workaround is instantiating first an additional TimeSeriesDataset object on the data prior to the forecasting date, and then passing its fitted target_normalizer to the desired TimeSeriesDataset used for inference, to prevent the data leakage, but this is highly inefficient (you are defining almost the same object twice). A more efficient approach would be to directly fitting a GroupNormalizer on the data and passing it, but this gives an error TypeError: '<' not supported between instances of 'int' and 'str', probably because X’s index (named id) is encoded and self.norm_’s index (name id too) is not encoded (if you have a workaround for this that would also be helpful).
Code to reproduce the problem
inference = TimeSeriesDataSet(
df,
time_idx="time_idx",
target="sales",
group_ids=["id"],
max_encoder_length=max_encoder_length,
max_prediction_length=max_prediction_length,
target_normalizer=GroupNormalizer(groups=["id"]),
predict_mode=True
)
Any update about this?
Hello,
I'm not 100% sure, but from what I've understood in the code, if you use the from_dataset method to create your inference dataset from your trainining dataset, I don't think the GroupNormalizer is fitted again, so there's no future leakage.
However, in your code example, as you're creating inference from scratch, you do have leakage. In this particular case I think playing with the min_prediction_idx parameter might do the trick, but I haven't checked the code.
The TimeSeriesDataSet target normaliser is also impacting us to decoupling training and inference: https://github.com/jdb78/pytorch-forecasting/issues/1345