What’s Dataset?

Describe the proposal

Datassets is a library for easily accessing and sharing Tabular structured data and data sets for non-Tabular audio, computer vision, and natural language processing (NLP) tasks.

For training a deep learning model, the dataset may be split to train and test. In general, the training dataset is used in the training stage and the test dataset is used in the eval stage.

1.1 Dataset Object

There are two types of dataset objects, a regular Dataset and then an IterableDataset. A Dataset provides fast random access to the rows, and memory mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely!

Split dataset represents a dictionary, the key is the split name and the value is the Dataset object.

1.2 Split

As described above, the datasets are typically split into different sub-datasets to be used at various stages for model training. Such as: training, testing and evaluation.

2. Create Dataset

Before supporting these features, Gravitino should support the meta management for model training and access control features. The following feature design is based on the above assumptions.

3. Load Dataset

Wherever a dataset is stored, the Gravitino Datasets should help the user to load it from Apache Gravitino. So we propose the architecture for loading datasets in the Gravitino Datasets library as outlined below:

3.1 Catalog

Load the dataset from Gravitino should use the granted token. Gravitino Datasets library gets the metadata from Gravitino and generates the sub-dataset for the user.

Design Document

https://docs.google.com/document/d/1_gMfkiwc4T56xtE0ZRpla_yD09hqf2MSHKAsWbK-eSc/edit
https://docs.google.com/document/d/1NdHc52U6tW9acHNWOfGiCEr08XO6VlcHf1q-n8mD60w/edit

Task list

[ ] https://github.com/apache/gravitino/issues/4233

Jul 08 '24 12:07 jiwq

This looks a good try to make the AI features be scoped in the gravitino management. I'm not sure whether the features that used in realtime/offline batch/once-time inference should be involved in this design like the feature store did?

I lack certain background knowledge about this design, feel free to point out if I'm wrong.

Jul 09 '24 02:07 zuston

This looks a good try to make the AI features be scoped in the gravitino management. I'm not sure whether the features that used in realtime/offline batch/once-time inference should be involved in this design like the feature store did?

I lack certain background knowledge about this design, feel free to point out if I'm wrong.

@zuston Sorry for the later reply. We store the dataset not the features and discussed many times in offline meeting. We will start the coding and give the POC ASAP, so that everyone can understand it better.

Jul 22 '24 10:07 jiwq

@jiwq Does fsspec consider using https://github.com/fsspec/opendalfs to implement various storage APIs? Add an opendal layer：

Jul 29 '24 09:07 iodone

[EPIC] Gravitino Datasets library

What’s Dataset?

Describe the proposal

1.1 Dataset Object

1.2 Split

2. Create Dataset

3. Load Dataset

3.1 Catalog

Design Document

Task list