Tom
Tom
Roughly, the way PyTorch handles spawn is that the workers are spawned in subprocesses where torch.distributed is not initialized; in addition, the Dataset instances are restored from pickled data on...
In WebDataset, data is primarily shuffled at the shard level, and you get the equivalent of PyTorch's samplers now: that's the `shardlist=` argument. Your `shardlist` class can sample the shards...
Yes, sorry. You can now simply write: ``` shardlist = wds.PytorchShardList("imagenet-{000000..000015}.tgz", epoch_shuffle=True) dataset = wds.WebDataset(shardlist, ...)... loader = wds.WebLoader(dataset, num_workers=4, batch_size=20) ``` That is, you can either pass in URLs...
There are a bunch of different tradeoffs you need to consider for optimal performance. First, you can speed up loading by doing the `select(...)` before the `decode` and any data...
Yes, unfortunately, there are lots of subtle interactions in the way the Dataloader handles worker processes and internal state right now (torchdata will clean this up). I'll have to add...
I'm not sure why gsutil would fail. Does it fail if you just run `gsutil cat ... > /dev/null`? Does it fail if you use zero workers? You can use...
If you are getting errors in a subprogram and want to ignore them and simultaneously log the errors, just use something like this: ```Bash url = "pipe:s3cmd cat ... 2>...
I'm not sure where the problems with gsutil are coming from; we haven't seen those problems with subprocesses and I can't reproduce this. I have several possible solutions: - use...
Yes, my suggestion for caching was just to enable pre-copying; that is, if the cache size is smaller than the dataset, shards will just get downloaded, used, and then quickly...
You don't have to use a streaming transfer; you can also use something like: `gsutil cp $url /tmp/$$ && cat /tmp$$; rm -f /tmp/$$` You can either put that directly...