BigDL-2.x icon indicating copy to clipboard operation
BigDL-2.x copied to clipboard

Chronos repository dataset API design

Open TheaperDeng opened this issue 3 years ago • 0 comments

Before all

This proposal is inspired by the implementation of GluonTS.

Why do we need a Repository Dataset API?

  • It is inevitable that we have some redundant code to transform the raw data to TSDataset's minimal requirement (a datetime col and at least one target col) in our demo/quickstart/use-case. We can simplify those codes and clean these demo/quickstart/use-case.
  • .sh code is not user-friendly and it might cause some problem (e.g. Incomplete download issue has been reported), while a Repository Dataset API can handle these download failure.
  • Benchmark process

What can a Repository Dataset API do?

  • Download the raw data and manage it in a default/user specific path.
  • Preprocess* the raw data and return a TSDataset.
  • Cache the processed data**.

* We only carry out simple preprocessing (i.e. to meet the minimal require of TSDataset)

** We will talk about this later, we can skip this for the first version.

What will our API be like?
# chronos.data.public_dataset.py

def get_public_dataset(name,
                       path="~/.chronos/dataset/",
                       redownload=False):
    '''
    Get a public dataset.
    
    >>> from zoo.chronos.data import get_public_dataset
    >>> tsdata_network_traffic = get_public_dataset(name="network_traffic")
    
    :param name: str, public dataset name, e.g. "network_traffic".
    :param path: str, download path, the value defaults to "~/.chronos/dataset/".
    :param redownload: bool, if redownload the raw dataset file(s).
    '''

TheaperDeng avatar Jul 12 '21 02:07 TheaperDeng