gluonts icon indicating copy to clipboard operation
gluonts copied to clipboard

Irregular time series support

Open kashif opened this issue 2 years ago • 19 comments

Issue #, if available:

Description of changes:

Adding support to irregular time series where the datatimes are given by an INDEX key in the dataset

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: BREAKING, new feature, bug fix, other change, dev setup

kashif avatar May 12 '22 10:05 kashif

To understand it correctly, "irregular" means that the values are not spaced apart by the same time offset?

jaheba avatar May 16 '22 07:05 jaheba

right! Which then implies we need to specify the time index for each point of the target... I believe i have training working... stuck at the inference part... I would appreciate some help! Let me push the latest changes and example to test...

kashif avatar May 16 '22 07:05 kashif

@jaheba I am using the following snippet to test:

import random
import pandas as pd

from gluonts.dataset.common import Dataset, ListDataset
from gluonts.torch.model.deepar import DeepAREstimator
from gluonts.evaluation import make_evaluation_predictions, Evaluator

freq_str="1T"
prediction_length = 10

index = sorted(random.sample(list(pd.date_range("2021-01-01 02:12:06",
        periods=400, 
        freq=freq_str)), 50+prediction_length))

train_ds = ListDataset(
    [
        {
        "start": index[0], 
        "target": [float(i) for i in range(50)],
        "item_id": "0",
        "index": index[:50],
        },
    ],
    freq=freq_str,
) 

test_ds = ListDataset(
    data_iter=[
        {
        "start": index[0], 
        "target": [float(i) for i in range(50+prediction_length)],
        "item_id": "0",
        "index": index[:50+prediction_length],
        }
    ],
    freq=freq_str,
) 

estimator = DeepAREstimator(
    freq=freq_str,
    prediction_length=prediction_length,
    scaling=False,
    trainer_kwargs={"max_epochs": 1},
)

predictor = estimator.train(train_ds)

forecast_it, ts_it = make_evaluation_predictions(
    dataset=test_ds,
    predictor=predictor
)

forecasts = list(forecast_it)
tss = list(ts_it)

evaluator = Evaluator()
agg_metrics, ts_metrics = evaluator(iter(tss), iter(forecasts), num_series=len(test_ds))

kashif avatar May 16 '22 07:05 kashif

Can we add the snippet as a test?

jaheba avatar May 16 '22 08:05 jaheba

sure... let me do that...

kashif avatar May 16 '22 08:05 kashif

Something I'm wondering: If we merge this PR we break the assumption that time-series are regular. Is that an assumption we rely on in other places? /cc @lostella

jaheba avatar May 16 '22 08:05 jaheba

as long as there is no INDEX field the assumption of regular time interval is assumed and currently everything proceeds as before... if there is an INDEX then this assumption is broken and as far as I can tell it is used in:

  • time features
  • lags
  • age feature
  • padding in the Splitters? <- not sure about this yet...
  • inference <- currently not working here

Apart from that I do not believe there is anything else relying on this assumption... (as far as I can tell)

kashif avatar May 16 '22 08:05 kashif

so the one big missing piece is in the InstanceSplitter which currently is setting the wrong forecast_start_field field

kashif avatar May 16 '22 09:05 kashif

@jaheba #1975 is the change to use FieldName. Once merged I'll fix up this PR wrt the conflicts.

So I believe, basically, this functionality is working... could you kindly see if there might be some edge cases? One I believe is the padding in the instance splitters...

kashif avatar May 16 '22 14:05 kashif

Something I'm wondering: If we merge this PR we break the assumption that time-series are regular. Is that an assumption we rely on in other places?

The "uniform sampling" assumption is baked in a few of different places, at least on the surface. For example:

  • estimators are usually configured with a freq which they can use for internal logic;
  • predictors have a freq which is used to determine where in time the forecast should start, and what time points it should cover;
  • the start attribute in data entries, which ends up being a pd.Timestamp with a certain frequency (although this is deprecated in pandas, see #1939); the start field could be redundant in case index is provided, and this is a potential source of very annoying bugs.

From the model perspective, nothing prevents you from breaking the "uniform sampling" assumption, even in RNN-based models like DeepAR. Whether it will work well, that's another question.

I'll need to take a deeper look.

lostella avatar May 17 '22 07:05 lostella

@lostella I believe I have taken care of all the 3 issues you so kindly listed and I believe I am able to train and do inference on irregular time series data now.

The freq string is used for 2 purposes and its use continues to be valid in the irregular case. In the irregular case, the freq string denotes the resolution at which the time series target data was collected. In the estimators, as you know, the freq is used to create time features and that is still valid in the irregular case. It's only when the lib. creates date ranges from start and freq is where I need to use the data's index.

The reason I need the start is that it's a convenient place for me to store the datetime's freq which is used in a bunch of places... therefore for now I need both the start and index field...

The only edge case I have in my mind right now is the padding in the Splitters... if you could kindly have a closer look at that, I would appreciate it!

Thanks!

kashif avatar May 17 '22 07:05 kashif

@jaheba another potential issue i have is if i run the code snippet above and then call forecasts[0] in jupyter I get the following error in the serializer:

forecasts[0]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/.env/pytorch/lib/python3.8/site-packages/IPython/core/formatters.py:707, in PlainTextFormatter.__call__(self, obj)
    700 stream = StringIO()
    701 printer = pretty.RepresentationPrinter(stream, self.verbose,
    702     self.max_width, self.newline,
    703     max_seq_length=self.max_seq_length,
    704     singleton_pprinters=self.singleton_printers,
    705     type_pprinters=self.type_printers,
    706     deferred_pprinters=self.deferred_printers)
--> 707 printer.pretty(obj)
    708 printer.flush()
    709 return stream.getvalue()

File ~/.env/pytorch/lib/python3.8/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File ~/.env/pytorch/lib/python3.8/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/gluon-ts-PR/src/gluonts/core/component.py:310, in validated.<locals>.validator.<locals>.validated_repr(self)
    309 def validated_repr(self) -> str:
--> 310     return dump_code(self)

File ~/gluon-ts-PR/src/gluonts/core/serde/_repr.py:115, in dump_code(o)
     95 def dump_code(o: Any) -> str:
     96     """
     97     Serializes an object to a Python code string.
     98 
   (...)
    112         Inverse function.
    113     """
--> 115     return as_repr(encode(o))

File /usr/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/gluon-ts-PR/src/gluonts/core/serde/_base.py:220, in encode(v)
    212 if hasattr(v, "__getnewargs_ex__"):
    213     args, kwargs = v.__getnewargs_ex__()  # mypy: ignore
    215     return {
    216         "__kind__": Kind.Instance,
    217         "class": fqname_for(v.__class__),
    218         # args need to be a list, since we encode tuples explicitly
    219         "args": encode(list(args)),
--> 220         "kwargs": encode(kwargs),
    221     }
    223 try:
    224     # as fallback, we try to just take the path of the value
    225     fqname = fqname_for(v)

File /usr/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/gluon-ts-PR/src/gluonts/core/serde/_base.py:207, in encode(v)
    204     return list(map(encode, v))
    206 if isinstance(v, dict):
--> 207     return valmap(encode, v)
    209 if isinstance(v, type):
    210     return {"__kind__": Kind.Type, "class": fqname_for(v)}

File ~/.env/pytorch/lib/python3.8/site-packages/toolz/dicttoolz.py:83, in valmap(func, d, factory)
     72 """ Apply function to values of dictionary
     73 
     74 >>> bills = {"Alice": [20, 15, 30], "Bob": [10, 35]}
   (...)
     80     itemmap
     81 """
     82 rv = factory()
---> 83 rv.update(zip(d.keys(), map(func, d.values())))
     84 return rv

File /usr/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/gluon-ts-PR/src/gluonts/core/serde/_base.py:243, in encode(v)
    240 except AttributeError:
    241     pass
--> 243 raise RuntimeError(bad_type_msg.format(fqname_for(v.__class__)))

RuntimeError: Cannot serialize type pandas.core.indexes.datetimes.DatetimeIndex. See the documentation of the `encode` and
`validate` functions at

    http://gluon-ts.mxnet.io/api/gluonts/gluonts.html

and the Python documentation of the `__getnewargs_ex__` magic method at

    https://docs.python.org/3/library/pickle.html#object.__getnewargs_ex__

for more information how to make this type serializable.

kashif avatar May 17 '22 10:05 kashif

@jaheba ok I added a DatetimeIndex encoder to fix the above issue I believe...

kashif avatar May 19 '22 17:05 kashif

I will update this I suppose when the PR #1980 is merged as I can then base my changes based on that

kashif avatar May 30 '22 08:05 kashif

@jaheba and @lostella I would be ready for a review. If you could check closely the padding in the splitters I would appreciate it! Thank you!

kashif avatar May 31 '22 12:05 kashif

@rsnirwan would you have some time to quickly see how I can also use the new pandas datasets with irregular time indices in this PR?

kashif avatar Jul 01 '22 07:07 kashif

@kashif @jaheba @lostella Is this resolved? Can we use irregular time series forecasting using DeepAR? This is highly required for some practical cases.

AjinkyaBankar avatar Dec 01 '23 18:12 AjinkyaBankar

hmm @AjinkyaBankar kind-of, things were refactored and this branch kind of got left behind... i would love to get it up to speed but it might require a new PR

kashif avatar Dec 01 '23 19:12 kashif

@kashif I am excited to use this as soon as possible. Let me know if I can help with this.

AjinkyaBankar avatar Dec 02 '23 00:12 AjinkyaBankar