datasets
datasets copied to clipboard
Can you please add the Stanford dog dataset?
Adding a Dataset
- Name: Stanford dog dataset
- Description: The dataset is about 120 classes for a total of 20.580 images. You can find the dataset here http://vision.stanford.edu/aditya86/ImageNetDogs/
- Paper: http://vision.stanford.edu/aditya86/ImageNetDogs/
- Data: link to the Github repository or current dataset location
- Motivation: *The dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization. It is useful for fine-grain purpose *
Instructions to add a new dataset can be found here.
would you like to give it a try, @dgrnd4? (maybe with the help of the dataset author?)
@julien-c i am sorry but I have no idea about how it works: can I add the dataset by myself, following "instructions to add a new dataset"? Can I add a dataset even if it's not mine? (it's public in the link that I wrote on the post)
Hi! The ADD NEW DATASET instructions are indeed the best place to start. It's also perfectly fine to add a dataset if it's public, even if it's not yours. Let me know if you need some additional pointers.
If no one is working on this, I could take this up!
@khushmeeet this is the link where I added the dataset already. If you can I would ask you to do this:
- The dataset it's all in TRAINING SET: can you please divide it in Training,Test and Validation Set? If you can for each class, take the 80% for the Training set and the 10% for Test and 10% Validation
- The images has different size, can you please resize all the images in 224,224,3? Look even at the last dimension "3" because some images has dimension 4!
Thank you!!
Hi @khushmeeet! Thanks for the interest. You can self-assign the issue by commenting #self-assign
on it.
Also, I think we can skip @dgrnd4's steps as we try to avoid any custom processing on top of raw data. One can later copy the script and override _post_process
in it to perform such processing on the generated dataset.
Thanks @mariosasko
@dgrnd4 As dataset is there on Hub, and preprocessing is not recommended. I am not sure if there is any other task to do. However, I can't seem to find relevant .py
files for this dataset in GitHub repo.
@khushmeeet @mariosasko The point is that the images must be processed and must have the same size in order to can be used for things for example "Training".
@dgrnd4 Yes, but this can be done after loading (map
to resize images and train_test_split
to create extra splits)
@khushmeeet The linked version is implemented as a no-code dataset and is generated directly from the ZIP archive, but our "GitHub" datasets (these are datasets without a user/org namespace on the Hub) need a generation script, and you can find one here. datasets
started as a fork of TFDS, so we share similar script structure, which makes it trivial to adapt it.
@mariosasko The point is that if I use something like this: x_train, x_test = train_test_split(dataset, test_size=0.1)
to get Train 90% and Test 10%, and then to get the Validation Set (10% of the whole 100%):
train_ratio = 0.80
validation_ratio = 0.10
test_ratio = 0.10
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
The point is that the structure of the data is:
DatasetDict({
train: Dataset({
features: ['image', 'label'],
num_rows: 20580
})
})
So how to extract images and labels?
EDIT --> Split of the dataset in Train-Test-Validation:
import datasets
from datasets.dataset_dict import DatasetDict
from datasets import Dataset
percentage_divison_test = int(len(dataset['train'])/100 *10) # 10% --> 2058
percentage_divison_validation = int(len(dataset['train'])/100 *20) # 20% --> 4116
dataset_ = datasets.DatasetDict({"train": Dataset.from_dict({
'image': dataset['train'][0 : len(dataset['train']) ]['image'],
'labels': dataset['train'][0 : len(dataset['train']) ]['label'] }),
"test": Dataset.from_dict({ #20580-4116 (validation) ,20580-2058 (test)
'image': dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['image'],
'labels': dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['label'] }),
"validation": Dataset.from_dict({ # 20580-2058 (test)
'image': dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['image'],
'labels': dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['label'] }),
})
@mariosasko in order to resize images I'm trying this method:
for i in range(0,len(dataset['train'])): #len(dataset['train'])
ex = dataset['train'][i] #i
image = ex['image']
image = image.convert("RGB") # <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=500x333 at 0x7F84F1948150>
image_resized = image.resize(size_to_resize) # <PIL.Image.Image image mode=RGB size=224x224 at 0x7F84F17885D0>
dataset['train'][i]['image'] = image_resized
Because the DatasetDict is formed by arrows that are immutable, the changing assignment in the last line of code, doesn't work! Do you have any idea in order to get a valid result?
#self-assign
I have raised PR for adding stanford-dog dataset. I have not added any data preprocessing code. Only dataset generation script is there. Let me know any changes required, or anything to add to README.
Is this issue still open, i am new to open source thus want to take this one as my start.
@zutarich This issue should have been closed since the dataset in question is available on the Hub here.