datasets
datasets copied to clipboard
help on adding new continual learning dataset into tfds
``What I need help with / What I was wondering Hello, I am trying to include a new dataset of continual learning into the tfds(suggested by researchers in google AI), here's what I have so far. However, I have some questions and wonder if anyone can give me some advices. Thanks in advance!
-
The dataset is make up of raw image of different size, how should I specific it? If I leave it as the way it is for now(None,None,3) , when I try to load it, it will have shape (Pdb) clear_data['train']['image'].shape TensorShape([32752, 500, 500, 3]) But the dataset have 33000 image in total. How should I do?
-
The dataset also come with pre-extracted feature(['moco_b0', 'moco_imagenet', 'byol_imagenet', 'imagenet']). Should I also include them into tfds? if so, how? They are in the form of tensor already and have different way of loading compare to the raw images.
-
As a continual learning dataset, data is split into 10 different continuous task(timestamp) already, should I list each task as 'split', or extra feature(the way I am doing right now)? BTW, the dataset don't have predefine train/test split.
-
The way of evaluation continual learning can be different from traditional dataset. How could I include the following two protocol with the dataset into tfds? If possible.
IID Protocol: Sample a test set from current task, which requires splitting the data into 7:3 train:test set. Streaming Protocol: Use the data of next task as the test set for current task, which is arguably more realistic since real-world model training and deployment usually takes considerable amount of time. By the time the model is applied, the task has already drifted.
What I've tried so far
"""clear dataset.""" import os import tensorflow as tf import tensorflow_datasets as tfds _DESCRIPTION = """ CLEAR is th a continual image classification benchmark dataset with a natural temporal evolution of visual concepts in the real world that spans a decade (2004-2014). It contains 33000 image data of 10 different timestamp, 11 class(10 illustrative classes and an 11th background class.), and 300 images per timestamp/class. Please refer to homepage for extracted feature & proposed evaluation. """ _CITATION = """ @inproceedings{lin2021clear, title={The CLEAR Benchmark: Continual LEArning on Real-World Imagery}, author={Lin, Zhiqiu and Shi, Jia and Pathak, Deepak and Ramanan, Deva}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2021}} """ _DATA_URL = 'https://drive.google.com/uc?export=download&id=1m9dAJtMynq1ayjx-R5vRNvN6tSuFgmll' _CLASS_NAME=['BACKGROUND','baseball','bus','camera','cosplay','dress','hockey','laptop','racing', 'soccer','sweater'] _NUM_TIME=10 class Clear(tfds.core.GeneratorBasedBuilder): """DatasetBuilder for clear dataset.""" VERSION = tfds.core.Version('1.0.0') RELEASE_NOTES = { '1.0.0': 'Initial release', } def _info(self) -> tfds.core.DatasetInfo: """Returns the dataset metadata.""" return tfds.core.DatasetInfo( builder=self, description=_DESCRIPTION, features=tfds.features.FeaturesDict({ # These are the features of your dataset like images, labels ... 'image': tfds.features.Image(shape=(None, None, 3)), 'label': tfds.features.ClassLabel(names=_CLASS_NAME), 'timestamp': tfds.features.ClassLabel( names=[str(time) for time in range(1,_NUM_TIME+1)]), }), supervised_keys=('image', 'label'), homepage='https://clear-benchmark.github.io/', citation=_CITATION, ) def _split_generators(self, dl_manager: tfds.download.DownloadManager): """Returns SplitGenerators.""" # Downloads the data and defines the splits extracted_path = dl_manager.download_and_extract(_DATA_URL) # print(extracted_path) return { tfds.Split.TRAIN: self._generate_examples( path=extracted_path / 'labeled_images', )} def _generate_examples(self, path): """Yields examples.""" # labeled_images/timestamp/class_name/file_path time_stamp =[str(time) for time in range(1,_NUM_TIME+1)] for time in time_stamp: time_path = os.path.join(path, time) for class_ in tf.io.gfile.listdir(time_path): class_path = os.path.join(path, time,class_) for file_name in tf.io.gfile.listdir(class_path): image = os.path.join(class_path, file_name) label = str(class_) timestamp= str(time) image_id = '%s_%s_%s' % (str(time), str(class_),file_name) yield image_id, {'image': image, 'label': label,'timestamp':timestamp}