webdataset
webdataset copied to clipboard
A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
Hey! Thanks for your great work on this library. I encountered very weird training behaviour with wds. I generated my tar files all in one folder as follows: ``` training_faa0dfb1-9775-4279-8146-f251997958c4.tar...
Hi, I'm confused about the new readme "complete pipeline" example. Why does it do `dataset.batched(16)`, then `wds.WebLoader(..., batch_size=8)`, then `.unbatched()`, then `.batched(12)`? It says, "batch in the dataset and then...
Clarification on epoch based 'at least once' pipeline for distributed, workers > 1, batched scenario
I've been working with 0.2.x recently updating a project that was using 0.1 and adding support to another, I have a rough template for a custom pipeline but there are...
Hi I see that the documentation for multinode still shows that we need to set `nodesplitter` but that's no longer an argument for `WebDataset`. Is there any reason why this...
Currently it is impossible to re-create bit-exact Webdatasets, as each file in the Tar archive has a different mtime. This has slightly annoying implications for file caching and versioning, as...
# ISSUE Build fails # CAUSE Missing line in requirements.txt https://github.com/webdataset/webdataset/blob/main/webdataset/shardlists.py#L14 # FIX Adding the missing requirement
Please have a look at https://github.com/webdataset/webdataset/blob/682b30ee484d719a954554654d2d6baa213f9371/webdataset/compat.py#L96-L108 When input `urls` is string like `data-{000..123).tar`, it seems the wds just append both nodesplitter and workersplitter twice, which results the yield data is...
I've been trying to duplicate the compose implementation given in the documentation, but copying the source gives me the following error. > import matplotlib.pyplot as plt > import torch.utils.data >...
Hi, Thanks for this great project. I have a few questions on using Webdataset for tensorflow. I referred to this [github repository](https://github.com/webdataset/webdataset-tensorflow/blob/main/resnet-multi.py) to set up the data and model trainer...