pytorch-forecasting
pytorch-forecasting copied to clipboard
Setting predict=True/False in validation TimeSeriesDataset changes the number of batches of training epochs?
- PyTorch-Forecasting version: 0.9.0
- PyTorch version: 1.9.0+cu102
- Python version: 3.6
- Operating System:
What I want to achieve
Hi everybody, I am trying to fit a temporal fusion transformer model on a training set and, after every x training batches, perform a validation epoch on a separate validation set. The validation epoch should evaluate the model iterating over the whole validation set and not only the last time series samples (which, if I understand correctly, is what happens when the predict=True is set on a TimeseriesDataset).
Expected behavior
I have tried different experiments to achieve the above in pytorch forecasting but still without success. In the tft tutorial the approach is the following:
training = TimeSeriesDataSet(data[lambda x: x.time_idx <= training_cutoff], ...)
# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True)
# create dataloaders for model
batch_size = 128 # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size , num_workers=0)
However this is not what I want to accomplish, since it will validate on only the last sequences of the training data. My guess, was that to do what I want I should do something like:
training = TimeSeriesDataSet(data[lambda x: x.time_idx <= training_cutoff], ...)
# create validation using a separate chunk of data, set predict=False since I want to validate on the whole set
validation = TimeSeriesDataSet.from_dataset(training, data[lambda x: x.time_idx > training_cutoff], predict=False, stop_randomization=False)
# create dataloaders for model
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
# set train to False since I do not want to drop the last batch
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size , num_workers=0)
And this should have worked as usual:
# fit network
trainer.fit(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
)
by running a validation epoch on the val_dataset
Actual behavior
Unexpectedly setting predict=False in the validation dataset somehow makes the number of batches on each training epoch grow and by a lot. Is this expected?
Code to reproduce the problem
# Imports
import pandas as pd
import numpy as np
# Workaround known bug on tensorflow https://github.com/jdb78/pytorch-forecasting/issues/58
import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile
from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import MAE
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
import torch
torch.__version__ #1.9.0+cu102
pytorch_forecasting.__version__ #0.9.0
# Create dummy train/val datasets
df_train = list()
df_val = list()
for g in range(10):
dft = pd.DataFrame([])
dft["time_idx"] = range(30)
dft["known_real"] = np.random.rand(30)
dft["unknown_real"] = np.random.rand(30)
dft["target"] = np.random.rand(30)
dft["group"] = str(g)
df_train.append(dft)
dfv = pd.DataFrame([])
dfv["time_idx"] = range(30, 60)
dfv["known_real"] = np.random.rand(30)
dfv["unknown_real"] = np.random.rand(30)
dfv["target"] = np.random.rand(30)
dfv["group"] = str(g)
df_val.append(dfv)
df_train = pd.concat(df_train, ignore_index=True)
df_val = pd.concat(df_val, ignore_index=True)
# Define train TimeSeriesDataset and corresponding dataloader
batch_size = 4
max_encoder_length = 5
max_prediction_length = 3
training = TimeSeriesDataSet(
df_train,
time_idx="time_idx",
target="target",
group_ids=["group"],
min_encoder_length=max_encoder_length, # encoder_length -> look-back window since it is fixed min_encoder_lenfth = max_encoder_length
max_encoder_length=max_encoder_length,
min_prediction_length=max_prediction_length,
max_prediction_length=max_prediction_length,
static_categoricals=["group"],
static_reals=[],
time_varying_known_reals=["known_real"],
time_varying_unknown_reals=["unknown_real", "target"],
target_normalizer=GroupNormalizer(
groups=["group"], transformation="softplus"
), # use softplus and normalize by group
add_relative_time_idx=False, # add relative time_idx as feature
add_target_scales=False, # add target scales as static real features
add_encoder_length=False, # add encoder length as static real features
)
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=4)
### Approach one predict=False
# Define val_dataloader with predict=False
validation = TimeSeriesDataSet.from_dataset(training, df_val, predict=False, stop_randomization=False)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=4)
# Define model and trainer
#mc = pl.callbacks.ModelCheckpoint(monitor='val_loss')
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=15, verbose=False, mode="min")
lr_logger = LearningRateMonitor() # log the learning rate
#logger = TensorBoardLogger("lightning_logs/") # logging results to a tensorboard
trainer = pl.Trainer(
max_epochs=1,
gpus=1,
weights_summary="top",
gradient_clip_val=0.1,
limit_train_batches=30, # coment in for training, running valiation every 30 batches
# fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
callbacks=[early_stop_callback]
)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.1,
hidden_size=4,
attention_head_size=1,
dropout=0.1,
hidden_continuous_size=4,
output_size=1,
loss=MAE(),
log_interval=10, # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
reduce_on_plateau_patience=6,
)
# Train -> Each training epoch has 88 batches
trainer.fit(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
)
### Approach two predict=True
# Define val_dataloader with predict=False
validation = TimeSeriesDataSet.from_dataset(training, df_val, predict=True, stop_randomization=True)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=4)
# Define model and trainer
#mc = pl.callbacks.ModelCheckpoint(monitor='val_loss')
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=15, verbose=False, mode="min")
lr_logger = LearningRateMonitor() # log the learning rate
#logger = TensorBoardLogger("lightning_logs/") # logging results to a tensorboard
trainer = pl.Trainer(
max_epochs=1,
gpus=1,
weights_summary="top",
gradient_clip_val=0.1,
limit_train_batches=30, # coment in for training, running valiation every 30 batches
# fast_dev_run=True, # comment in to check that networkor dataset has no serious bugs
callbacks=[early_stop_callback]
)
tft = TemporalFusionTransformer.from_dataset(
training,
learning_rate=0.1,
hidden_size=4,
attention_head_size=1,
dropout=0.1,
hidden_continuous_size=4,
output_size=1,
loss=MAE(),
log_interval=10, # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
reduce_on_plateau_patience=6,
)
# Train -> Each training epoch has 33 batches
trainer.fit(
tft,
train_dataloader=train_dataloader,
val_dataloaders=val_dataloader,
)
Hi, I independently came around the same conclusions. It would be very useful to improve the tutorial by proposing a different validation method than just "over the last sample" as imposed by predict = True. Most people would like to validate over several sequence in a given lookback window:
Cutoff_Date = data['Datetime'].max() - pd.to_timedelta('30D')
data_train = data[data['Datetime'] < Cutoff_Date]
data_val = data[data['Datetime'] >= Cutoff_Date]
batch_size = 128
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization = True)
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers = 0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers = 0)
Also unclear for most users whether stop_randomization should be set to True or False depending on the context.
I ran into this as well. Still not sure whether the increase in batches per epoch due to setting predict=False is a bug or expected behavior. I'm also not sure whether stop_randomization should be set to True or False
I also would like to have a clarification on why a validation set > max_prediction_length is not implemented/advised exemplified ...
@jdb78 @josesydor @Emungai @polal2is @chefPony
validating over the last sample of each group, sometimes make model overfit to last sample.. So I tried below to validate longer sequence. FYI
validation = TimeSeriesDataSet.from_dataset(training, data,min_encoder_length=max_encoder_length, max_encoder_length=max_encoder_length, predict=False, stop_randomization=True, min_prediction_idx=training_cutoff + 1)