BigDL-2.x
BigDL-2.x copied to clipboard
Chronos repository dataset API design
Before all
This proposal is inspired by the implementation of GluonTS.
Why do we need a Repository Dataset API?
- It is inevitable that we have some redundant code to transform the raw data to TSDataset's minimal requirement (a datetime col and at least one target col) in our demo/quickstart/use-case. We can simplify those codes and clean these demo/quickstart/use-case.
-
.sh
code is not user-friendly and it might cause some problem (e.g. Incomplete download issue has been reported), while a Repository Dataset API can handle these download failure. - Benchmark process
What can a Repository Dataset API do?
- Download the raw data and manage it in a default/user specific path.
- Preprocess* the raw data and return a
TSDataset
. - Cache the processed data**.
* We only carry out simple preprocessing (i.e. to meet the minimal require of TSDataset
)
** We will talk about this later, we can skip this for the first version.
What will our API be like?
# chronos.data.public_dataset.py
def get_public_dataset(name,
path="~/.chronos/dataset/",
redownload=False):
'''
Get a public dataset.
>>> from zoo.chronos.data import get_public_dataset
>>> tsdata_network_traffic = get_public_dataset(name="network_traffic")
:param name: str, public dataset name, e.g. "network_traffic".
:param path: str, download path, the value defaults to "~/.chronos/dataset/".
:param redownload: bool, if redownload the raw dataset file(s).
'''