webdataset
webdataset copied to clipboard
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
Hi, May I ask if there is a way to set the seed for the shuffle function for the reproducibility consideration? Thanks!
I actually wrote a gist implementing this: https://gist.github.com/harpone/3b6003c22295a50cbd3d2cfc566dc115 Uses the torch-xla distributed MpDeviceLoader with shard splitting across accelerators and workers, with checks that all the minibatches are indeed unique. Just...
First of all, thanks for the amazing lib! I hope it makes it into PyTorch core soon. I see that the imagenet example performs no casting to long in order...
When using webdataset with pytorch-lightning, I discovered that if I pass dataloaders to pytorch-lightning as instances of MultiDataset, training will stall on epoch 0. Once I changed the dataloaders to...
Shard writer seems to not work with a gcloud url. However setting stream to self.fname at this line seems to solve the problem. https://github.com/webdataset/webdataset/blob/main/webdataset/writer.py#L406 Is there a reason the file...
1. Add more extensions supported by Pillow 2. Fix the test error in test_gopen
I'm trying to use webdataset on a CI but it fails when using webdataset caching. To reproduce use the following dockerfile: ```docker FROM ubuntu:20.04 ENV LANG=C.UTF-8 RUN apt-get update &&...
Hi, I'm using webdataset with S3 with multiple shards. I'm using automatic sharding to avoid download the data more than once. It's not clear from the docs if webdataset downloads...
``` Original Traceback (most recent call last): File "/home/dome/.local/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/dome/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch data.append(next(self.dataset_iter)) File "/home/dome/.local/lib/python3.7/site-packages/webdataset/pipeline.py", line 68, in iterator for...
- Problem I try transforming the dataset [ImageNet-C](https://drive.google.com/drive/folders/1HDVw6CmX3HiG0ODFtI75iIfBDxSiSz2K) (an image classification dataset) into webdataset tarfile formats. The original dataset includes 4 tar files that store image samples. The size of...