FastAI.jl icon indicating copy to clipboard operation
FastAI.jl copied to clipboard

Dataset recipes

Open lorenzoh opened this issue 4 years ago • 14 comments

With #151, FastAI.jl is getting high-level interfaces for searching datasets (finddatasets) and loading datasets into task-specific data containers (loaddataset). There is also a new DatasetRecipe that encapsulates configuration for loading a data container and the block information from a path. These recipes can be registered with a dataset so that they can be found using the above high-level functions.

The fastai dataset colletion comes with quite a lot of datasets, so only a few have recipes yet. This issue tracks the progress on adding recipes to all the datasets. Contributions of recipe types and recipe configs for datasets are welcome.

See src/datasets/recipes.jl for example recipe implementations and src/datasets/fastairegistry for how recipes are registered. listdatasources() gives you a list of all dataset sources and datasetpath(name) downloads them and returns the download folder.

Progress

For datasets that can be used for multiple tasks, they are listed below. Otherwise a checked dataset that at least one recipe is already implemented.

  • [x] CUB_200_2011
  • [ ] bedroom (not sure how the folders are layed out)
  • [x] caltech_101
  • [x] cifar10
  • [x] cifar100
  • [ ] food-101
  • [x] imagenette-160
  • [x] imagenette-320
  • [x] imagenette
  • [x] imagenette2-160
  • [x] imagenette2-320
  • [x] imagenette2
  • [x] imagewang-160
  • [x] imagewang-320
  • [x] imagewang
  • [x] imagewoof-160
  • [x] imagewoof-320
  • [x] imagewoof
  • [x] imagewoof2-160
  • [x] imagewoof2-320
  • [x] imagewoof2
  • [x] mnist_png
  • [x] mnist_var_size_tiny
  • [ ] oxford-102-flowers
  • [ ] oxford-iiit-pet
  • [ ] stanford-cars
  • [ ] ag_news_csv
  • [ ] amazon_review_full_csv
  • [ ] amazon_review_polarity_csv
  • [ ] dbpedia_csv
  • [ ] giga-fren
  • [ ] imdb
  • [ ] sogou_news_csv
  • [ ] wikitext-103
  • [ ] wikitext-2
  • [ ] yahoo_answers_csv
  • [ ] yelp_review_full_csv
  • [ ] yelp_review_polarity_csv
  • [ ] biwi_head_pose
  • [x] camvid
  • [ ] pascal-voc
  • [ ] pascal_2007
    • [x] multi-label image classification ((Image{2}, LabelMulti))
    • [ ] object detection
  • [ ] pascal_2012
  • [ ] siim_small
  • [ ] skin-lesion
  • [ ] tcga-small
  • [x] adult_sample
  • [ ] biwi_sample
  • [x] camvid_tiny
  • [ ] dogscats
  • [ ] human_numbers
  • [ ] imdb_sample
  • [x] mnist_sample
  • [x] mnist_tiny
  • [ ] movie_lens_sample
  • [ ] planet_sample
  • [ ] planet_tiny
  • [ ] coco_sample
  • [ ] coco-train2017
  • [ ] coco-val2017
  • [ ] coco-test2017
  • [ ] coco-unlabeled2017
  • [ ] coco-image_info_test2017
  • [ ] coco-image_info_unlabeled2017
  • [ ] coco-annotations_trainval2017
  • [ ] coco-stuff_annotations_trainval2017
  • [ ] coco-panoptic_annotations_trainval2017

lorenzoh avatar Aug 10 '21 22:08 lorenzoh

Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.

Edit: ref. https://github.com/JuliaML/MLDatasets.jl/issues/73 as well.

ToucheSir avatar Aug 10 '21 22:08 ToucheSir

It might be worth also looking at DataSets.jl announced at JuliaCon.

darsnack avatar Aug 10 '21 23:08 darsnack

Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.

At some point, all the dataset functionality should me merged down to MLDatasets.jl and MLDataPattern.jl.

The registry itself is pretty barebones; if you take away the functionality related to blocks, then you could replace it with a Dict{String, Vector{DatasetRecipe}} that maps a list of recipes to a dataset.

lorenzoh avatar Aug 11 '21 08:08 lorenzoh

At some point we'll have to think about iterable datasets and at that point some rearchitecting DataSets.jl could be useful. It should also not be too hard to add iterable support to DataLoaders.jl.

For now I want to provide a useful core of offline datasets here in FastAI.jl with this simple approach. Rearchitecting should probably flow into the efforts in MLDatasets.jl (or perhaps a DLDatasets.jl if everything will be deprecated anyway?). I'll give a larger reply in https://github.com/JuliaML/MLDatasets.jl/issues/73 later

In any case, any recipe logic associated with the fastai datasets here should be easily relocatable later. 👍

lorenzoh avatar Aug 11 '21 08:08 lorenzoh

Some are being added in #163

lorenzoh avatar Aug 27 '21 17:08 lorenzoh

Hey, I'd like to work on this issue. Since this issue is labeled good first issue I believe I can help. Can you please specify to me what has to be done still cause I see the list above hasn't been updated?

Chandu-4444 avatar Jan 20 '22 11:01 Chandu-4444

Hey! The list above is uptodate. The easiest thing to get started with should be adding recipes for the csv datasets and registering some TableDatasetRecipes.

lorenzoh avatar Jan 20 '22 18:01 lorenzoh

Next I want to add recipes for dbpedia_csv, ag_news_csv. They all are in CSV format. But the labels were in separate files and the indexes of these labels are used in the actual CSV files. In that case, I think it is better to replace the label indices with the actual labels in the recipe code itself and then wrap it with TableClassificationRecipe? Are there any ideas to do this?

Chandu-4444 avatar Jan 24 '22 13:01 Chandu-4444

Might need a new recipe type that wraps TableRecipe, but can't say without looking at the folder structure

lorenzoh avatar Jan 26 '22 11:01 lorenzoh

fastai-dbpedia_csv/ └── dbpedia_csv      ├── classes.txt      ├── readme.txt      ├── test.csv      └── train.csv

This is the folder structure for both datasets (dbpedia_csv, ag_news_csv).

Chandu-4444 avatar Jan 26 '22 18:01 Chandu-4444

Might need a new recipe type that wraps TableRecipe, but can't say without looking at the folder structure

Is it necessary to make a new recipe for datasets that have folder structures similar to the one above? Or is it possible to tweak the existing ones to get the job done?

Chandu-4444 avatar Feb 26 '22 10:02 Chandu-4444

I think in this case it may be possible to create a new recipe that wraps TableRecipe (which loads the table) and then reads in the labels and converts label indices to label strings. I don't have the bandwidth to look into this in more detail currently, though.

lorenzoh avatar Mar 01 '22 13:03 lorenzoh

I think in this case it may be possible to create a new recipe that wraps TableRecipe (which loads the table) and then reads in the labels and converts label indices to label strings.

I'll work on this.

Chandu-4444 avatar Mar 01 '22 15:03 Chandu-4444

After the community meet, I explored fastAI, MLutils and couple of other libraries and tried to understand the codebase specifically . Would love to get started with adding a dataset , can you please specify which one of the above would be a good one to get started into , also I believe the list above isnt updated

arcAman07 avatar Mar 02 '22 21:03 arcAman07