pytorch-forecasting icon indicating copy to clipboard operation
pytorch-forecasting copied to clipboard

How to split data into validation and test sets using `.from_dataset()` from the same data object

Open e-alizadeh opened this issue 2 years ago • 10 comments

  • PyTorch-Forecasting version: 0.10.2
  • PyTorch-Lightning version: 1.7.0
  • PyTorch version: 1.12.0
  • Python version: 3.8.13
  • Operating System: MacOSX

I was going through the tutorials on the website, and they were mainly using test/validation set without splitting them separately. I'm interested in splitting data into training + validation + test sets. In this case, my assumption is to define the training TimeSeriesDataSet object and then calling .from_dataset() to generate the validation and test datasets. I need help there as it's not clear to me how to do that.

For simplicity, let's say we have a dataset of 100 observations with time_idx from 0 to 100 and we want to split the data as following:

  • training: full_data.iloc[:80]
  • validation: full_data.iloc[80:90]
  • test: full_data.iloc[90:] I know we probably want to keep more data in each set to consider the encoder_context_length, but above is to just provide the split I'm interested in.
training_cutoff = 80

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    ...
)

# How validation and test dataset should be define using `.from_dataset()` method
validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1)

test = TimeSeriesDataSet.from_dataset(??)

e-alizadeh avatar Aug 30 '22 17:08 e-alizadeh

I am struggling to find the answer for the same question? Did you find the answer?

kurvaraviteja355 avatar Sep 08 '22 10:09 kurvaraviteja355

Why not something like this :

training_cutoff = 80
validation_cutoff = 90

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    ...
)

validation = TimeSeriesDataSet.from_dataset(training,
    data[lambda x: x.time_idx<=validation_cutoff], 
    min_prediction_idx=training_cutoff + 1)

test = TimeSeriesDataSet.from_dataset(training, 
    data, 
    min_prediction_idx=validation_cutoff + 1)

Seam8 avatar Sep 13 '22 18:09 Seam8

I have tried but the got the AssertionError

AssertionError: filters should not remove entries all entries - check encoder/decoder lengths and lags

kurvaraviteja355 avatar Sep 19 '22 12:09 kurvaraviteja355

For which dataset ? It look like your dataset does not provide any sequences satisfying the minimum prediction length and the minimum encoder length. Are you sure your test set is long enough ?

Seam8 avatar Sep 19 '22 18:09 Seam8

Hey,

thank you, Its working, There is small mistake from my side.

I dont find any difference in your code like normal train & valid datasets because you are training the model only with training dataset not on validation dataset which is of no different than regular one.

kurvaraviteja355 avatar Sep 21 '22 10:09 kurvaraviteja355

I would like to be sure that the method provided by @Seam8 makes the model to use the same scaler (StandardScaler), instead of creating different for each dataset.

cserpell avatar Sep 21 '22 19:09 cserpell

I have not checked through break points but I know by experience that you will encounter an error in case some of your categorical classes are absent from the training set. So I guess the encoders and scalers are only fitted on training set.

Seam8 avatar Oct 27 '22 17:10 Seam8

Indeed, I might be wrong but looking at this: https://github.com/jdb78/pytorch-forecasting/blob/308ea850b82d3a8e8a397f58d589dae9da904eff/pytorch_forecasting/data/timeseries.py#L813-L827

It seems that in case self.scalers is already set, scalers are not fitted anymore, except if check_is_fitted() trigger an error.

Seam8 avatar Oct 27 '22 17:10 Seam8

Any news on this one ? Can we do like 80/20 split ? how do we do that?

deltawi avatar Jan 11 '24 07:01 deltawi

Why not something like this :

training_cutoff = 80
validation_cutoff = 90

training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    ...
)

validation = TimeSeriesDataSet.from_dataset(training,
    data[lambda x: x.time_idx<=validation_cutoff], 
    min_prediction_idx=training_cutoff + 1)

test = TimeSeriesDataSet.from_dataset(training, 
    data, 
    min_prediction_idx=validation_cutoff + 1)

I have few questions

  1. I understand why you took data for training before and equals to training cutoff, but i don't understand why you took data for validation before and equals to validation cutoff, should it not be greater than training cutoff and less that equals to validation cutoff?
  2. Why did you not put any cutoff for testing data?

sauravsingh-couture avatar Apr 05 '24 11:04 sauravsingh-couture