BigDL-2.x Chronos repository dataset API design

Chronos repository dataset API design

Open TheaperDeng opened this issue 3 years ago • 0 comments

Before all

This proposal is inspired by the implementation of GluonTS.

Why do we need a Repository Dataset API?

It is inevitable that we have some redundant code to transform the raw data to TSDataset's minimal requirement (a datetime col and at least one target col) in our demo/quickstart/use-case. We can simplify those codes and clean these demo/quickstart/use-case.
.sh code is not user-friendly and it might cause some problem (e.g. Incomplete download issue has been reported), while a Repository Dataset API can handle these download failure.
Benchmark process

What can a Repository Dataset API do?

Download the raw data and manage it in a default/user specific path.
Preprocess* the raw data and return a TSDataset.
Cache the processed data**.

* We only carry out simple preprocessing (i.e. to meet the minimal require of TSDataset)

** We will talk about this later, we can skip this for the first version.

What will our API be like?

# chronos.data.public_dataset.py

def get_public_dataset(name,
                       path="~/.chronos/dataset/",
                       redownload=False):
    '''
    Get a public dataset.
    
    >>> from zoo.chronos.data import get_public_dataset
    >>> tsdata_network_traffic = get_public_dataset(name="network_traffic")
    
    :param name: str, public dataset name, e.g. "network_traffic".
    :param path: str, download path, the value defaults to "~/.chronos/dataset/".
    :param redownload: bool, if redownload the raw dataset file(s).
    '''

Jul 12 '21 02:07 TheaperDeng

BigDL-2.x BigDL-2.x copied to clipboard

Chronos repository dataset API design

Before all

Why do we need a Repository Dataset API?

What can a Repository Dataset API do?

What will our API be like?

BigDL-2.x
BigDL-2.x copied to clipboard