img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

How to download SBUcaptions and Visual Genome (VG) dataset in webdataset format

Open sanyalsunny111 opened this issue 2 years ago • 4 comments

For Vision and Language pretraining cc3m, mscoco, SBUcaptions and VG are very relevant datasets. I haven't been able to download SBU captions and VG. Here are my questions.

  1. How to download SBU captions and VG's metadata?
  2. How to download these datasets on webdataset format?

Could you also please provide me with a tutorial or just some hints to download it in webdataset format using img2dataset? Thank you in advance.

sanyalsunny111 avatar Sep 13 '22 16:09 sanyalsunny111

https://www.cs.rice.edu/~vo9/sbucaptions/ sbu captions is provided as url + captions in json. You can use that as input of img2dataset

rom1504 avatar Sep 13 '22 23:09 rom1504

https://visualgenome.org/api/v0/api_home.html visual genome is not distributed as image urls, so you can simply download the images and make a tar with them. That's what webdataset is.

rom1504 avatar Sep 13 '22 23:09 rom1504

For SBUcaptions here is my code to download it in webdataset format wget https://www.cs.rice.edu/~vo9/sbucaptions/sbu-captions-all.tar.gz tar -xvzf sbu-captions-all.tar.gz sed -i '1s/^/caption\turl\n/' sbu-captions-all.json img2dataset --url_list sbu-captions-all.json --input_format "json" --url_col "URL" --caption_col "TEXT" --output_format webdataset --output_folder sbucaptions --processes_count 16 --thread_count 64 --image_size 256

Here is the error I have gotten image

sanyalsunny111 avatar Sep 14 '22 01:09 sanyalsunny111

For Visual genome, I am a bit confused about how to make a tar of 1 image and caption and where to get those captions any instructions on that?

sanyalsunny111 avatar Sep 14 '22 01:09 sanyalsunny111

This is the way to download SBU captions @rom1504 please also add this to your example file. wget https://www.cs.rice.edu/~vo9/sbucaptions/sbu-captions-all.tar.gz

tar -xvzf sbu-captions-all.tar.gz

img2dataset --url_list sbu-captions-all.json --input_format "json" --url_col "image_urls" --caption_col "captions" --output_format webdataset --output_folder sbucaptions --processes_count 16 --thread_count 64 --image_size 256

sanyalsunny111 avatar Nov 24 '22 02:11 sanyalsunny111

if you can open a PR to add an example in the dataset folder, it would be great

rom1504 avatar Nov 24 '22 03:11 rom1504

Added a PR as dicussed. @rom1504 Could you please let me know how to down VG images and captions as webdataset?

sanyalsunny111 avatar Nov 25 '22 02:11 sanyalsunny111

sbu-captions is done

VG images can be converted using tar directly

rom1504 avatar Dec 21 '22 04:12 rom1504