gluonts
gluonts copied to clipboard
Handling multiple multi-variate time-series in a Dataset
Apologies if this capability already exists, I went through the examples, issues (like #1095) and discussions but couldnt figure out an existing way of (satisfactorily) doing this.
Description
This is related to #695 - A requirement to handle multiple multi-variate time-series wrapped in a ListDataset
As an example, consider a dummy data of 10 multi-variate time-series each of (19, 300)
dim i.e. length=300
and num_feat =19
data = [{'start': ts, 'target': np.randn(19, 300), 'freq': freq} for _ in range(10)]
dataset = ListDataset(data, freq=freq, one_dim_target=False)
I cannot catenate all 10 chunks into one, because of irregularly sampling. Example: data is sampled for a month, then offline for 2 months, again sampled for a month... so on. I have cleaned the dataset such that all 10 time-series chunks have the same freq
.
Padding, interpolation isnt possible/ideal because of the long sampling breaks in the data. And each individual time-series is sufficiently longer than prediction_len
+ context_len
, so intra-timeseries batch-sampling is not an issue.
Is such a Dataset formulation possible currently? FWIW, I tried doing as shown above and got this
Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation:
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=10, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=338, min_future=10, num_instances=1.0, total_length=0, n=0), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=338, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])
Hi @hashbangCoder,
can you illustrate on a data snipped what the problem is? Not sure I got it.
@hashbangCoder
The error
Exception: Reached maximum number of idle transformation calls.
usually comes from if many of your time series are shorter than your forecasting horizon. You might want to try reducing the forecasting horizon and check if the error still appears.
Thanks for the response @mbohlkeschneider
Apologies for the delay in response. My data is multivariate time-series that is irregularly sampled. For this example, im using a toy dataset. I've created a python code snippet with comments that hopefully explains my issue better, takes about 2-3 mins to run.
I was working with pytorch-ts (which is built on gluonts) but while debugging discovered the issue is with gluonts
. So i managed to recreate it with DeepVar
model
I'm running this on CPU and Windows OS with python 3.7
import pandas as pd
import numpy as np
from gluonts.dataset.common import ListDataset
from gluonts.model.deepvar import DeepVAREstimator
from gluonts.mx.trainer import Trainer as GluonTrainer
# 1M timestamped dataset of 19 sensor values, but irregularly sampled
sensor_data = pd.DataFrame(np.random.randn(1000000, 19))
# `sample_inds` is list of tuples `[(start_ind1, end_ind1), (start_ind2, end_ind2), etc]` where each tuple is start and end index for subsampling by slicing from `sensor_data`
# such that within each subsequence (eg: `sensor_data.iloc[start_ind1: end_ind1, :]`) timestamps are evenly spaced
# all sub-sequences are of same length (assume 300)
# for this example, sample_inds is randomly generated
sample_inds = [(0, 300)]
# generate 100 sub-sequences by randomly sampling indices such that every sub-seq len == 300
for _ in range(100):
low = np.random.randint(sample_inds[-1][1], sample_inds[-1][1] + 1000)
high = low + 300
sample_inds.append((low, high))
# assume freq is 1min, prediction/forecast length = 50, context length = 250
# so the model looks at 250mins of data (250 samples) and predicts 50 mins; total len = 300
freq = '1min'
prediction_len = 50
# create train dataset
forecast_train = []
ts = pd.Timestamp('24th Aug 2009')
for start_ind, end_ind in sample_inds:
# randomly increasing timestamp, irrelevant for example
ts = ts + pd.Timedelta(f'{np.random.randint(5, 10)}D')
forecast_train.append(
{'start': ts, 'target': sensor_data.iloc[start_ind: end_ind - prediction_len, :].values.transpose(),
'freq': freq})
estimator = DeepVAREstimator(target_dim=19,
prediction_length=50,
context_length=250,
freq="1min",
trainer=GluonTrainer(epochs=10))
print('start training')
forecast_train = ListDataset(forecast_train, freq=freq, one_dim_target=False)
predictor = estimator.train(forecast_train)
I get this error
File "C:\Users\hashbangcoder\AppData\Roaming\Python\Python37\site-packages\gluonts\transform\_base.py", line 142, in __call__
f"Reached maximum number of idle transformation calls.\n"
Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation:
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=50, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=251, min_future=50, num_instances=1.0, total_length=0, n=0), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=251, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])
I hope you understood why i cannot concatenate all the sub-sequences together or interpolate values b/w sub-seqs due to irregular sampling of sensor_data
So my question boils down to - Is there any way i can combine multiple multi-variate time-series in a single ListDataset
?
Also @StatMixedML , my forecasting horizon is 50
and my time-series are of length 250
. Even with forecast horizon = 20, its the same error. I suspect its due to the way Sampler
works with multiple time-series in ListDataset
.
I'm thinking if I do concatenate all my sub-sequences into a single 2D target
and use a custom Sampler
(subclassing InstanceSampler
) that can sample at the indices corresponding to start of my sub-seqs and with exact same lengths, it should work?
@hashbangCoder Are you using MultivariateGrouper
and grouper_train
for grouping the data? For an example on how to use it, see here
Hi @hashbangCoder,
Thank you for the snipped. The transformation fails because the InstanceSplitter
cannot draw long enough samples from your data. Your data has 250 timestamps and with context_length=250
and prediction_length=50
you are asking to draw samples of length 300. Thus, the transformation fails. You can set the parameter pick_incomplete=True
in the DeepVAREstimator
. Then the samples will be padded. However, I would suggest to reduce the context_length
. 250 is quite high and probably a model with much lower context_length
will do better and run a lot faster.
Hello @mbohlkeschneider, unfortunately my @hashbangCoder account has issues with 2FA device and im unable to login from there. So im using an older account.
I understand the issue is in ExpectedNumInstanceSampler
and can lower my context_len
and prediction_len
and test it out.
a, b = self._get_bounds(ts)
window_size = b - a + 1
I do have a few followup questions on ExpectedNumInstanceSampler
and history_length
used in DeepVAR
(if you dont mind).
- As I understand
context_len
(i.e. what the RNN sees) is increased byself.lags_seq
which depends on thefreq
. I dont understand why this is needed? Over and above the user specifiedcontext_len
.
self.history_length = self.context_length + max(self.lags_seq)
- The
ExpectedNumInstanceSampler
slices input seqs for RNN withseq_start_index > history_length
. If i havefreq=1H
, andcontext_len=24
then the sampler slices with indices b/w(24 + 168)
&total_len - prediction_len
.
self.train_sampler = (
train_sampler
if train_sampler is not None
else ExpectedNumInstanceSampler(
num_instances=1.0,
min_past=0 if pick_incomplete else self.history_length,
min_future=prediction_length,
)
)
- can we exclude the extra
lags_seq
added tocontext_len
?
You can set lags_seq=1
to just use the previous value as a lag. However, I would advice to the reduce the context length and keep the lags.
Thanks a lot for your help. Will try this to see how well it works.
Hey! Even I am facing the same issue, Is the issue resolved?