DALLE-datasets icon indicating copy to clipboard operation
DALLE-datasets copied to clipboard

add a table with dataset sizes

Open rom1504 opened this issue 4 years ago • 3 comments

having a table with dataset and some information about size/ time to download would be useful https://docs.google.com/document/d/1KCAB-OTHphcCh-4oITIL8r7ih-HuslMKX1Rls_P03CY/edit could serve as complementary information

rom1504 avatar Jul 07 '21 21:07 rom1504

I will add information here as I download things. Starting with CC3M, I intend to download it then produce some clip embeddings (using https://github.com/rom1504/clip-retrieval/) / list of clip filtered files

Once it's clear enough, will PR to readme

rom1504 avatar Jul 07 '21 22:07 rom1504

I downloaded cc3m and cc12m (improving their script a bit in the process)

  • cc3m took 20h and resulted in 100GB of resized images, 5.6M of them, in size on dimension 320 the other larger
  • cc12m took 10h and resulted in 300GB of resized images, 10.7M of them, in size 256

cc3m can obviously take way less time if using the improved script of cc12m I confirmed in the process that handling million of files is painful and will make it possible to download directly as collection of tars (== webdataset format)

rom1504 avatar Jul 15 '21 08:07 rom1504

@rom1504 the doc is not available now. i want to download the data, can you please help me. I just find download_open_images.txt file in the repo. how to download using text file ?

kartikpodugu avatar May 18 '23 01:05 kartikpodugu