img2dataset
img2dataset copied to clipboard
How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format
For Vision and Language pretraining cc3m, mscoco, SBUcaptions and VG are very relevant datasets. I haven't been able to download SBU captions and VG. Here are my questions.
- How to download SBU captions and VG's metadata?
- How to download these datasets on webdataset format?
Could you also please provide me with a tutorial or just some hints to download it in webdataset format using img2dataset? Thank you in advance.
https://www.cs.rice.edu/~vo9/sbucaptions/ sbu captions is provided as url + captions in json. You can use that as input of img2dataset
https://visualgenome.org/api/v0/api_home.html visual genome is not distributed as image urls, so you can simply download the images and make a tar with them. That's what webdataset is.
For SBUcaptions here is my code to download it in webdataset format
wget https://www.cs.rice.edu/~vo9/sbucaptions/sbu-captions-all.tar.gz
tar -xvzf sbu-captions-all.tar.gz
sed -i '1s/^/caption\turl\n/' sbu-captions-all.json
img2dataset --url_list sbu-captions-all.json --input_format "json" --url_col "URL" --caption_col "TEXT" --output_format webdataset --output_folder sbucaptions --processes_count 16 --thread_count 64 --image_size 256
Here is the error I have gotten
For Visual genome, I am a bit confused about how to make a tar of 1 image and caption and where to get those captions any instructions on that?
This is the way to download SBU captions @rom1504 please also add this to your example file.
wget https://www.cs.rice.edu/~vo9/sbucaptions/sbu-captions-all.tar.gz
tar -xvzf sbu-captions-all.tar.gz
img2dataset --url_list sbu-captions-all.json --input_format "json" --url_col "image_urls" --caption_col "captions" --output_format webdataset --output_folder sbucaptions --processes_count 16 --thread_count 64 --image_size 256
if you can open a PR to add an example in the dataset folder, it would be great
Added a PR as dicussed. @rom1504 Could you please let me know how to down VG images and captions as webdataset?
sbu-captions is done
VG images can be converted using tar directly