Bug With Uber TLC Daily Dataset
Description
I have encountered an issue with the Uber TLC dataset because it does not have an Item ID. This is an issue when using an OffsetSplitter.
To Reproduce
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.dataset.split import OffsetSplitter
uber_dataset = get_dataset("uber_tlc_daily")
splitter = OffsetSplitter(prediction_length=7, split_offset=-7)
split = splitter.split(uber_dataset.train)
Error message or code output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_29164/1151371949.py in <cell line: 4>()
2
3 splitter = OffsetSplitter(prediction_length=7, split_offset=-7)
----> 4 split = splitter.split(uber_dataset.train)
~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/gluonts/dataset/split/splitter.py in split(self, items)
212 split = TrainTestSplit()
213
--> 214 for item in map(TimeSeriesSlice.from_data_entry, items):
215
216 train = self._train_slice(item)
~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/gluonts/dataset/split/splitter.py in from_data_entry(cls, item, freq)
98 return TimeSeriesSlice(
99 target=pd.Series(item["target"], index=index),
--> 100 item=item[FieldName.ITEM_ID],
101 feat_static_cat=feat_static_cat,
102 feat_static_real=feat_static_real,
KeyError: 'item_id'
Luckily I think there is an easy fix. The Uber data has a location ID, and I think this can easily be used as the item ID. I propose the following change: here is the line where the change would start in the uber tlc loader file:
test_format_dict = {
"start": start_time,
"target": target,
"feat_static_cat": feat_static_cat,
"item_id": locationID, # new line
}
test_data.append(test_format_dict)
train_format_dict = {
"start": start_time,
"target": target[:-prediction_length],
"feat_static_cat": feat_static_cat,
"item_id": locationID, # new line
}
train_data.append(train_format_dict)
Environment
- Operating system: Amazon Linux 2 (SageMaker)
- Python version: 3.8.12
- GluonTS version: 0.10.2
- MXNet version: N/A (using PyTorch)
@mvanness354 thank you for spotting this! The change you propose makes sense to me, assuming that locationIDs are indeed distinct across the dataset. Do you want to open a PR for this?
As a side note: the splitting functionality is slightly changing in the dev branch. You can find a tutorial here on how to use it (feedback on it is appreciated, cc @npnv who authored it). As part of the changes in dev, if I’m not wrong, the item_id should not be required for splitting a series (why would it?) so you could also try that.
@lostella Sorry for the delay, I've just made a PR for this issue. Please let me know if I've done anything wrong. PR: https://github.com/awslabs/gluon-ts/pull/2214