gluonts icon indicating copy to clipboard operation
gluonts copied to clipboard

Bug With Uber TLC Daily Dataset

Open mvanness354 opened this issue 3 years ago • 1 comments

Description

I have encountered an issue with the Uber TLC dataset because it does not have an Item ID. This is an issue when using an OffsetSplitter.

To Reproduce

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.dataset.split import OffsetSplitter

uber_dataset = get_dataset("uber_tlc_daily")
splitter = OffsetSplitter(prediction_length=7, split_offset=-7)
split = splitter.split(uber_dataset.train)

Error message or code output

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_29164/1151371949.py in <cell line: 4>()
      2 
      3 splitter = OffsetSplitter(prediction_length=7, split_offset=-7)
----> 4 split = splitter.split(uber_dataset.train)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/gluonts/dataset/split/splitter.py in split(self, items)
    212         split = TrainTestSplit()
    213 
--> 214         for item in map(TimeSeriesSlice.from_data_entry, items):
    215 
    216             train = self._train_slice(item)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/gluonts/dataset/split/splitter.py in from_data_entry(cls, item, freq)
     98         return TimeSeriesSlice(
     99             target=pd.Series(item["target"], index=index),
--> 100             item=item[FieldName.ITEM_ID],
    101             feat_static_cat=feat_static_cat,
    102             feat_static_real=feat_static_real,

KeyError: 'item_id'

Luckily I think there is an easy fix. The Uber data has a location ID, and I think this can easily be used as the item ID. I propose the following change: here is the line where the change would start in the uber tlc loader file:

test_format_dict = {
    "start": start_time,
    "target": target,
    "feat_static_cat": feat_static_cat,
    "item_id": locationID, # new line
}
test_data.append(test_format_dict)

train_format_dict = {
    "start": start_time,
    "target": target[:-prediction_length],
    "feat_static_cat": feat_static_cat,
    "item_id": locationID, # new line
}
train_data.append(train_format_dict)

Environment

  • Operating system: Amazon Linux 2 (SageMaker)
  • Python version: 3.8.12
  • GluonTS version: 0.10.2
  • MXNet version: N/A (using PyTorch)

mvanness354 avatar Aug 04 '22 23:08 mvanness354

@mvanness354 thank you for spotting this! The change you propose makes sense to me, assuming that locationIDs are indeed distinct across the dataset. Do you want to open a PR for this?

As a side note: the splitting functionality is slightly changing in the dev branch. You can find a tutorial here on how to use it (feedback on it is appreciated, cc @npnv who authored it). As part of the changes in dev, if I’m not wrong, the item_id should not be required for splitting a series (why would it?) so you could also try that.

lostella avatar Aug 05 '22 06:08 lostella

@lostella Sorry for the delay, I've just made a PR for this issue. Please let me know if I've done anything wrong. PR: https://github.com/awslabs/gluon-ts/pull/2214

mvanness354 avatar Aug 12 '22 21:08 mvanness354