DALLE-pytorch icon indicating copy to clipboard operation
DALLE-pytorch copied to clipboard

Implement WebDataset

Open afiaka87 opened this issue 3 years ago • 4 comments

Edit: @robvanvolt is right

WebDataset is perfect for us - any dataset already in the format expect by the TextImageDataset we have now can easily be converted to a WebDataset by placing them in ~512MiB subdirectories and tarring up each one. Each tar is now considered a WebDataset "shard" which can be efficiently loaded. Everything in WebDataset communicates via HTTPS - even on localhost. As such, there is little distinction between a list of URLS containing archives and a list of paths containing the archives. This has a number of benefits for helping with distributed, and for dealing with massive datasets which can't possibly all load onto disk at once e.g. Previews (6 TiB) - or at least not for many people.

afiaka87 avatar Apr 25 '21 08:04 afiaka87

Maybe we should convert our datasets to tar.gz and work with WebDatasetloaders? This might be a faster approach than optimizing reading text and image folders, doesn't it? But cool anyway to get some extra speed of the current approach! (Y)

robvanvolt avatar Apr 26 '21 18:04 robvanvolt

Maybe we should convert our datasets to tar.gz and work with WebDatasetloaders? This might be a faster approach than optimizing reading text and image folders, doesn't it? But cool anyway to get some extra speed of the current approach! (Y)

I agree. WebDataset is perfect for our usecase.

afiaka87 avatar Apr 29 '21 22:04 afiaka87

Added early beta support for WebDatasets: https://github.com/lucidrains/DALLE-pytorch/pull/280

robvanvolt avatar Jun 01 '21 21:06 robvanvolt

Added early beta support for WebDatasets: #280

Added full support for WebDatasets: https://github.com/lucidrains/DALLE-pytorch/pull/280 Does anyone want to try the new feature out or review the changes?

robvanvolt avatar Jun 11 '21 18:06 robvanvolt