pytorch-forecasting
pytorch-forecasting copied to clipboard
How to split data into validation and test sets using `.from_dataset()` from the same data object
- PyTorch-Forecasting version: 0.10.2
- PyTorch-Lightning version: 1.7.0
- PyTorch version: 1.12.0
- Python version: 3.8.13
- Operating System: MacOSX
I was going through the tutorials on the website, and they were mainly using test/validation set without splitting them separately.
I'm interested in splitting data into training + validation + test sets. In this case, my assumption is to define the training TimeSeriesDataSet
object and then calling .from_dataset()
to generate the validation and test datasets. I need help there as it's not clear to me how to do that.
For simplicity, let's say we have a dataset of 100 observations
with time_idx
from 0 to 100 and we want to split the data as following:
- training:
full_data.iloc[:80]
- validation:
full_data.iloc[80:90]
- test:
full_data.iloc[90:]
I know we probably want to keep more data in each set to consider theencoder_context_length
, but above is to just provide the split I'm interested in.
training_cutoff = 80
training = TimeSeriesDataSet(
data[lambda x: x.time_idx <= training_cutoff],
...
)
# How validation and test dataset should be define using `.from_dataset()` method
validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff + 1)
test = TimeSeriesDataSet.from_dataset(??)
I am struggling to find the answer for the same question? Did you find the answer?
Why not something like this :
training_cutoff = 80
validation_cutoff = 90
training = TimeSeriesDataSet(
data[lambda x: x.time_idx <= training_cutoff],
...
)
validation = TimeSeriesDataSet.from_dataset(training,
data[lambda x: x.time_idx<=validation_cutoff],
min_prediction_idx=training_cutoff + 1)
test = TimeSeriesDataSet.from_dataset(training,
data,
min_prediction_idx=validation_cutoff + 1)
I have tried but the got the AssertionError
AssertionError: filters should not remove entries all entries - check encoder/decoder lengths and lags
For which dataset ? It look like your dataset does not provide any sequences satisfying the minimum prediction length and the minimum encoder length. Are you sure your test set is long enough ?
Hey,
thank you, Its working, There is small mistake from my side.
I dont find any difference in your code like normal train & valid datasets because you are training the model only with training dataset not on validation dataset which is of no different than regular one.
I would like to be sure that the method provided by @Seam8 makes the model to use the same scaler (StandardScaler), instead of creating different for each dataset.
I have not checked through break points but I know by experience that you will encounter an error in case some of your categorical classes are absent from the training set. So I guess the encoders and scalers are only fitted on training set.
Indeed, I might be wrong but looking at this: https://github.com/jdb78/pytorch-forecasting/blob/308ea850b82d3a8e8a397f58d589dae9da904eff/pytorch_forecasting/data/timeseries.py#L813-L827
It seems that in case self.scalers is already set, scalers are not fitted anymore, except if check_is_fitted() trigger an error.
Any news on this one ? Can we do like 80/20 split ? how do we do that?
Why not something like this :
training_cutoff = 80 validation_cutoff = 90 training = TimeSeriesDataSet( data[lambda x: x.time_idx <= training_cutoff], ... ) validation = TimeSeriesDataSet.from_dataset(training, data[lambda x: x.time_idx<=validation_cutoff], min_prediction_idx=training_cutoff + 1) test = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=validation_cutoff + 1)
I have few questions
- I understand why you took data for training before and equals to training cutoff, but i don't understand why you took data for validation before and equals to validation cutoff, should it not be greater than training cutoff and less that equals to validation cutoff?
- Why did you not put any cutoff for testing data?