Tom comments

Results 170 comments of

Tom

Unexpected Shuffling Behavior

Roughly, the way PyTorch handles spawn is that the workers are spawned in subprocesses where torch.distributed is not initialized; in addition, the Dataset instances are restored from pickled data on...

Unexpected Shuffling Behavior

In WebDataset, data is primarily shuffled at the shard level, and you get the equivalent of PyTorch's samplers now: that's the `shardlist=` argument. Your `shardlist` class can sample the shards...

Unexpected Shuffling Behavior

Yes, sorry. You can now simply write: ``` shardlist = wds.PytorchShardList("imagenet-{000000..000015}.tgz", epoch_shuffle=True) dataset = wds.WebDataset(shardlist, ...)... loader = wds.WebLoader(dataset, num_workers=4, batch_size=20) ``` That is, you can either pass in URLs...

How to best subsample a dataset?

There are a bunch of different tradeoffs you need to consider for optimal performance. First, you can speed up loading by doing the `select(...)` before the `decode` and any data...

detshuffle epoch count the same across epochs

Yes, unfortunately, there are lots of subtle interactions in the way the Dataloader handles worker processes and internal state right now (torchdata will clean this up). I'll have to add...

gsutil cat intermittently fails

I'm not sure why gsutil would fail. Does it fail if you just run `gsutil cat ... > /dev/null`? Does it fail if you use zero workers? You can use...

gsutil cat intermittently fails

If you are getting errors in a subprogram and want to ignore them and simultaneously log the errors, just use something like this: ```Bash url = "pipe:s3cmd cat ... 2>...

gsutil cat intermittently fails

I'm not sure where the problems with gsutil are coming from; we haven't seen those problems with subprocesses and I can't reproduce this. I have several possible solutions: - use...

gsutil cat intermittently fails

Yes, my suggestion for caching was just to enable pre-copying; that is, if the cache size is smaller than the dataset, shards will just get downloaded, used, and then quickly...

gsutil cat intermittently fails

You don't have to use a streaming transfer; you can also use something like: `gsutil cp $url /tmp/$$ && cat /tmp$$; rm -f /tmp/$$` You can either put that directly...