gluonts Bug With Uber TLC Daily Dataset

Description

I have encountered an issue with the Uber TLC dataset because it does not have an Item ID. This is an issue when using an OffsetSplitter.

To Reproduce

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.dataset.split import OffsetSplitter

uber_dataset = get_dataset("uber_tlc_daily")
splitter = OffsetSplitter(prediction_length=7, split_offset=-7)
split = splitter.split(uber_dataset.train)

Error message or code output

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_29164/1151371949.py in <cell line: 4>()
      2 
      3 splitter = OffsetSplitter(prediction_length=7, split_offset=-7)
----> 4 split = splitter.split(uber_dataset.train)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/gluonts/dataset/split/splitter.py in split(self, items)
    212         split = TrainTestSplit()
    213 
--> 214         for item in map(TimeSeriesSlice.from_data_entry, items):
    215 
    216             train = self._train_slice(item)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/gluonts/dataset/split/splitter.py in from_data_entry(cls, item, freq)
     98         return TimeSeriesSlice(
     99             target=pd.Series(item["target"], index=index),
--> 100             item=item[FieldName.ITEM_ID],
    101             feat_static_cat=feat_static_cat,
    102             feat_static_real=feat_static_real,

KeyError: 'item_id'

Luckily I think there is an easy fix. The Uber data has a location ID, and I think this can easily be used as the item ID. I propose the following change: here is the line where the change would start in the uber tlc loader file:

test_format_dict = {
    "start": start_time,
    "target": target,
    "feat_static_cat": feat_static_cat,
    "item_id": locationID, # new line
}
test_data.append(test_format_dict)

train_format_dict = {
    "start": start_time,
    "target": target[:-prediction_length],
    "feat_static_cat": feat_static_cat,
    "item_id": locationID, # new line
}
train_data.append(train_format_dict)

Environment

Operating system: Amazon Linux 2 (SageMaker)
Python version: 3.8.12
GluonTS version: 0.10.2
MXNet version: N/A (using PyTorch)

Aug 04 '22 23:08 mvanness354

@mvanness354 thank you for spotting this! The change you propose makes sense to me, assuming that locationIDs are indeed distinct across the dataset. Do you want to open a PR for this?

As a side note: the splitting functionality is slightly changing in the dev branch. You can find a tutorial here on how to use it (feedback on it is appreciated, cc @npnv who authored it). As part of the changes in dev, if I’m not wrong, the item_id should not be required for splitting a series (why would it?) so you could also try that.

Aug 05 '22 06:08 lostella

@lostella Sorry for the delay, I've just made a PR for this issue. Please let me know if I've done anything wrong. PR: https://github.com/awslabs/gluon-ts/pull/2214

Aug 12 '22 21:08 mvanness354