DALLE-datasets icon indicating copy to clipboard operation
DALLE-datasets copied to clipboard

download directly as wds

Open rom1504 opened this issue 3 years ago • 8 comments

would be much better than having 12M files... I will probably try this as the number of files is a problem for me with cc12m (writing the 12M captions only takes 10min), so scaling to larger number of files simply won't work in this state

I might write a generic downloader in the process

rom1504 avatar Jul 13 '21 16:07 rom1504

https://github.com/robvanvolt/DALLE-datasets/blob/main/utilities/wds_create_shards.py should help

I'm now even more convinced that this is needed after running cc12m downloader. Linux file systems are bad at handling more than a million files (it can take minutes to delete the files or even list them) cc12m would be 12M files of size 20k, 240GB in 256MB chunks that's only 938 files, which is much more manageable.

rom1504 avatar Jul 14 '21 15:07 rom1504

Agree 100%! It was actually meant to use the utilities one day to have a direct downloader - I used them separately the most time, so I forgot about the idea to merge these functions.. Especially regarding the "crazy" number-of-files-per-folder limit;):)

robvanvolt avatar Jul 14 '21 17:07 robvanvolt

hi @robvanvolt ; I ended up building this tool for downloading Crawling at home https://github.com/rom1504/img2dataset it can download and resize 100M image in 20h. It also saves that directly as webdataset it could download cc12m in 2.4h and cc3m in 40 minutes.

what would you think if I do a pr to replace the existing scripts to make them directly use that tool ?

rom1504 avatar Aug 23 '21 23:08 rom1504

Definitely, awesome work! Can you do a PR so I can merge it? 👍 Also, the downloading times are amazing!:))

robvanvolt avatar Aug 24 '21 11:08 robvanvolt

Yeah i will do it soon

rom1504 avatar Aug 24 '21 12:08 rom1504

finally got around to do this at https://github.com/rom1504/img2dataset/tree/main/examples I'm not sure how/if I should include that here as it would mostly delete the existing scripts what do you think @robvanvolt ?

rom1504 avatar Nov 13 '21 20:11 rom1504

Really nice!

I will most likely implement some features from this repo (not all uploaded to github, like the svg support) to image2dataset and use DALLE-datasets for wds examples, wds annotations, dataset sanity check and other "useful" utilities, as image2dataset is already such a powerful tool for downloading directly into wds that this seems to be the most appropriate way to do!:) But I will sleep a few more days on that matter x)

robvanvolt avatar Nov 17 '21 21:11 robvanvolt

sounds good!

rom1504 avatar Nov 17 '21 21:11 rom1504