datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Can you please add the Stanford dog dataset?

Open dgrnd4 opened this issue 2 years ago • 13 comments

Adding a Dataset

  • Name: Stanford dog dataset
  • Description: The dataset is about 120 classes for a total of 20.580 images. You can find the dataset here http://vision.stanford.edu/aditya86/ImageNetDogs/
  • Paper: http://vision.stanford.edu/aditya86/ImageNetDogs/
  • Data: link to the Github repository or current dataset location
  • Motivation: *The dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization. It is useful for fine-grain purpose *

Instructions to add a new dataset can be found here.

dgrnd4 avatar Jun 15 '22 15:06 dgrnd4

would you like to give it a try, @dgrnd4? (maybe with the help of the dataset author?)

julien-c avatar Jun 15 '22 15:06 julien-c

@julien-c i am sorry but I have no idea about how it works: can I add the dataset by myself, following "instructions to add a new dataset"? Can I add a dataset even if it's not mine? (it's public in the link that I wrote on the post)

dgrnd4 avatar Jun 15 '22 15:06 dgrnd4

Hi! The ADD NEW DATASET instructions are indeed the best place to start. It's also perfectly fine to add a dataset if it's public, even if it's not yours. Let me know if you need some additional pointers.

mariosasko avatar Jun 16 '22 10:06 mariosasko

If no one is working on this, I could take this up!

khushmeeet avatar Jul 03 '22 07:07 khushmeeet

@khushmeeet this is the link where I added the dataset already. If you can I would ask you to do this:

  1. The dataset it's all in TRAINING SET: can you please divide it in Training,Test and Validation Set? If you can for each class, take the 80% for the Training set and the 10% for Test and 10% Validation
  2. The images has different size, can you please resize all the images in 224,224,3? Look even at the last dimension "3" because some images has dimension 4!

Thank you!!

dgrnd4 avatar Jul 03 '22 08:07 dgrnd4

Hi @khushmeeet! Thanks for the interest. You can self-assign the issue by commenting #self-assign on it.

Also, I think we can skip @dgrnd4's steps as we try to avoid any custom processing on top of raw data. One can later copy the script and override _post_process in it to perform such processing on the generated dataset.

mariosasko avatar Jul 04 '22 11:07 mariosasko

Thanks @mariosasko

@dgrnd4 As dataset is there on Hub, and preprocessing is not recommended. I am not sure if there is any other task to do. However, I can't seem to find relevant .py files for this dataset in GitHub repo.

khushmeeet avatar Jul 04 '22 20:07 khushmeeet

@khushmeeet @mariosasko The point is that the images must be processed and must have the same size in order to can be used for things for example "Training".

dgrnd4 avatar Jul 06 '22 09:07 dgrnd4

@dgrnd4 Yes, but this can be done after loading (map to resize images and train_test_split to create extra splits)

@khushmeeet The linked version is implemented as a no-code dataset and is generated directly from the ZIP archive, but our "GitHub" datasets (these are datasets without a user/org namespace on the Hub) need a generation script, and you can find one here. datasets started as a fork of TFDS, so we share similar script structure, which makes it trivial to adapt it.

mariosasko avatar Jul 06 '22 11:07 mariosasko

@mariosasko The point is that if I use something like this: x_train, x_test = train_test_split(dataset, test_size=0.1)

to get Train 90% and Test 10%, and then to get the Validation Set (10% of the whole 100%):

train_ratio = 0.80
validation_ratio = 0.10
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

The point is that the structure of the data is:

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 20580
    })
})

So how to extract images and labels?

EDIT --> Split of the dataset in Train-Test-Validation:

import datasets
from datasets.dataset_dict import DatasetDict
from datasets import Dataset

percentage_divison_test = int(len(dataset['train'])/100 *10)       # 10%  --> 2058 
percentage_divison_validation = int(len(dataset['train'])/100 *20) # 20%  --> 4116

dataset_ = datasets.DatasetDict({"train": Dataset.from_dict({

                                  'image':  dataset['train'][0 : len(dataset['train']) ]['image'],    
                                  'labels': dataset['train'][0 : len(dataset['train']) ]['label'] }), 
                                 
                                 "test": Dataset.from_dict({  #20580-4116 (validation) ,20580-2058 (test)
                                  'image':  dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['image'], 
                                  'labels': dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['label'] }), 
                                 
                                  "validation": Dataset.from_dict({ # 20580-2058 (test)
                                  'image':  dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['image'], 
                                  'labels': dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['label'] }), 
                                })

dgrnd4 avatar Jul 06 '22 18:07 dgrnd4

@mariosasko in order to resize images I'm trying this method:

for i in range(0,len(dataset['train'])): #len(dataset['train'])

  ex = dataset['train'][i] #i
  image = ex['image']
  image = image.convert("RGB") # <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=500x333 at 0x7F84F1948150>
  image_resized = image.resize(size_to_resize) # <PIL.Image.Image image mode=RGB size=224x224 at 0x7F84F17885D0>

  dataset['train'][i]['image'] = image_resized 

Because the DatasetDict is formed by arrows that are immutable, the changing assignment in the last line of code, doesn't work! Do you have any idea in order to get a valid result?

dgrnd4 avatar Jul 07 '22 13:07 dgrnd4

#self-assign

khushmeeet avatar Jul 09 '22 00:07 khushmeeet

I have raised PR for adding stanford-dog dataset. I have not added any data preprocessing code. Only dataset generation script is there. Let me know any changes required, or anything to add to README.

khushmeeet avatar Jul 09 '22 04:07 khushmeeet

Is this issue still open, i am new to open source thus want to take this one as my start.

zutarich avatar Oct 09 '23 06:10 zutarich

@zutarich This issue should have been closed since the dataset in question is available on the Hub here.

mariosasko avatar Oct 18 '23 18:10 mariosasko