webdataset icon indicating copy to clipboard operation
webdataset copied to clipboard

Multinode documentation

Open MrRobot2211 opened this issue 3 years ago • 2 comments

Hi I see that the documentation for multinode still shows that we need to set nodesplitter but that's no longer an argument for WebDataset.

Is there any reason why this argument was dropped? I am trying to make this work with XLA

MrRobot2211 avatar Jan 16 '22 19:01 MrRobot2211

I recommend using the v2 branch. Among other things, in v2, node and worker splitting are explicit. There is a backwards compatible wrapper, so the switch should be easy.

dataset = wds.DataPipeline(
    wds.SimpleShardList("source-{000000..000999}.tar"),
    wds.split_by_node,
    wds.split_by_worker,
    wds.non_empty,
    wds.tarfile_samples,
    wds.shuffle(100),
    wds.decode(autodecode.ImageHandler("rgb")),
    wds.to_tuple("png", "cls"),    
)

tmbdev avatar Jan 18 '22 18:01 tmbdev

I have a follow-up question about this. Currently the readme suggests that using the following pipeline will automatically split by node and worker:

dataset = wds.DataPipeline(
    wds.ResampledShards("source-{000000..000999}.tar"),
    wds.non_empty,
    wds.tarfile_samples,
    wds.shuffle(100),
    wds.decode(autodecode.ImageHandler("rgb")),
    wds.to_tuple("png", "cls"),    
)

Is this equivalent to the method above? Thanks!

fattorib avatar Mar 29 '22 16:03 fattorib

They are not equivalent. Doing the split means during distributed data parallel training, each rank will get its unique share of the dataset, without overlapping with other ranks. On the other hand, ResampledShards gives you a infinite draw of random shards for each rank (with replacement, meaning with the possibility of getting the same shard for the same rank!), independent of other ranks, so there could be some overlapping.

jrcavani avatar Sep 28 '23 00:09 jrcavani

Note that resampling after splitting results in slightly uneven sample probabilities.

The ResampledShards implementation works great for large scale training with fast object stores. This is the case on high end compute clusters and for training in the cloud.

I have added another separate implementation of WebDataset called "wids" that is more suitable for the kind of multinode training that people are used to. In particular, it provides an indexed dataset and uses a locality aware distributed sampler. Look in ./examples to see how that works.

tmbdev avatar Jan 04 '24 19:01 tmbdev