[EPIC] Gravitino Datasets library
What’s Dataset?
Describe the proposal
Datassets is a library for easily accessing and sharing Tabular structured data and data sets for non-Tabular audio, computer vision, and natural language processing (NLP) tasks.
For training a deep learning model, the dataset may be split to train and test. In general, the training dataset is used in the training stage and the test dataset is used in the eval stage.
1.1 Dataset Object
There are two types of dataset objects, a regular Dataset and then an IterableDataset. A Dataset provides fast random access to the rows, and memory mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely!
Split dataset represents a dictionary, the key is the split name and the value is the Dataset object.
1.2 Split
As described above, the datasets are typically split into different sub-datasets to be used at various stages for model training. Such as: training, testing and evaluation.
2. Create Dataset
Before supporting these features, Gravitino should support the meta management for model training and access control features. The following feature design is based on the above assumptions.
3. Load Dataset
Wherever a dataset is stored, the Gravitino Datasets should help the user to load it from Apache Gravitino. So we propose the architecture for loading datasets in the Gravitino Datasets library as outlined below:
3.1 Catalog
Load the dataset from Gravitino should use the granted token. Gravitino Datasets library gets the metadata from Gravitino and generates the sub-dataset for the user.
Design Document
- https://docs.google.com/document/d/1_gMfkiwc4T56xtE0ZRpla_yD09hqf2MSHKAsWbK-eSc/edit
- https://docs.google.com/document/d/1NdHc52U6tW9acHNWOfGiCEr08XO6VlcHf1q-n8mD60w/edit
Task list
- [ ] https://github.com/apache/gravitino/issues/4233
This looks a good try to make the AI features be scoped in the gravitino management. I'm not sure whether the features that used in realtime/offline batch/once-time inference should be involved in this design like the feature store did?
I lack certain background knowledge about this design, feel free to point out if I'm wrong.
This looks a good try to make the AI features be scoped in the gravitino management. I'm not sure whether the features that used in realtime/offline batch/once-time inference should be involved in this design like the feature store did?
I lack certain background knowledge about this design, feel free to point out if I'm wrong.
@zuston Sorry for the later reply. We store the dataset not the features and discussed many times in offline meeting. We will start the coding and give the POC ASAP, so that everyone can understand it better.
@jiwq Does fsspec consider using https://github.com/fsspec/opendalfs to implement various storage APIs? Add an opendal layer: