returnn
returnn copied to clipboard
Dataset: implement global `dataset_distribution` option
Closes #1634
I think before merging this should get a dedicated test around the sharding.
Well then this should be a draft for now?
@albertz Do you think this needs a test around the config processing?
@albertz Do you think this needs a test around the config processing?
I'm not exactly sure what you mean by that.
But e.g. there could be a test with PyTorch DataLoader num_workers=2 which checks that all data from the dataset was properly covered.
@albertz Can you give another review for this PR?
To clarify: this also fixes #1678?
Yes.
Sorry for introducing the small conflict, but my change should fix #1678 and #1737 already, and shouldn't really cause any issues to merge with the PR here.
Btw, also see #1738. Not sure if this is relevant here.
Can you summarize what this PR does? I will also try to write some summarizes here myself, but please edit your main description of the PR to cover that as well.
(Summarize) Added feature: when torch.utils.data.DataLoader is used with num_workers>1, this will set the sharding accordingly. (This is independent of the newly introduced global dataset_distribution option.)
Btw, some questions regarding this:
Just to confirm: this is independent of the newly introduced global dataset_distribution option?
What happens when this is used together with distributed training? Will it set num_shards = distrib_world_size * dataloader_num_workers then?
Is the order of seqs you get from the DataLoader deterministic?
Will it always be complete? E.g. if one worker returns more seqs than the other (e.g. total num seqs is 11, and 2 workers), will the DataLoader finish only until all the workers have finished?
The _get_random_seed_for_epoch, shouldn't it also consider num_shards/shard_index? Or only in the case of dataset_distribution=="dataset_distribution"?
Or not because random_seed_offset already covers this part? (But I find it a bit inconsistent that epoch/partition_epoch is handled here but shard_index/num_shards elsewhere...)
(Summary) New global config option dataset_distribution, which can be either set to "random_seed_offset" (default) or "shard". This is for distributed training. "shard" will enable sharding for the dataset, so on N GPUs, processing one full epoch will only go through the data once, unlike with "random_seed_offset", where one full epoch sees all the data N times (each worker with different random seed).