returnn Dataset: implement global `dataset

Closes #1634

Jan 16 '25 13:01 NeoLegends

I think before merging this should get a dedicated test around the sharding.

Well then this should be a draft for now?

Jan 19 '25 19:01 albertz

@albertz Do you think this needs a test around the config processing?

Feb 06 '25 10:02 NeoLegends

@albertz Do you think this needs a test around the config processing?

I'm not exactly sure what you mean by that.

But e.g. there could be a test with PyTorch DataLoader num_workers=2 which checks that all data from the dataset was properly covered.

Feb 06 '25 11:02 albertz

@albertz Can you give another review for this PR?

Apr 23 '25 12:04 NeoLegends

To clarify: this also fixes #1678?

Jul 17 '25 08:07 albertz

Yes.

Jul 17 '25 09:07 NeoLegends

Sorry for introducing the small conflict, but my change should fix #1678 and #1737 already, and shouldn't really cause any issues to merge with the PR here.

Jul 17 '25 10:07 albertz

Btw, also see #1738. Not sure if this is relevant here.

Jul 17 '25 10:07 albertz

Can you summarize what this PR does? I will also try to write some summarizes here myself, but please edit your main description of the PR to cover that as well.

Jul 17 '25 16:07 albertz

(Summarize) Added feature: when torch.utils.data.DataLoader is used with num_workers>1, this will set the sharding accordingly. (This is independent of the newly introduced global dataset_distribution option.)

Btw, some questions regarding this:

Just to confirm: this is independent of the newly introduced global dataset_distribution option?

What happens when this is used together with distributed training? Will it set num_shards = distrib_world_size * dataloader_num_workers then?

Is the order of seqs you get from the DataLoader deterministic?

Will it always be complete? E.g. if one worker returns more seqs than the other (e.g. total num seqs is 11, and 2 workers), will the DataLoader finish only until all the workers have finished?

Jul 17 '25 16:07 albertz

The _get_random_seed_for_epoch, shouldn't it also consider num_shards/shard_index? Or only in the case of dataset_distribution=="dataset_distribution"?

Or not because random_seed_offset already covers this part? (But I find it a bit inconsistent that epoch/partition_epoch is handled here but shard_index/num_shards elsewhere...)

Jul 17 '25 18:07 albertz

(Summary) New global config option dataset_distribution, which can be either set to "random_seed_offset" (default) or "shard". This is for distributed training. "shard" will enable sharding for the dataset, so on N GPUs, processing one full epoch will only go through the data once, unlike with "random_seed_offset", where one full epoch sees all the data N times (each worker with different random seed).

Jul 17 '25 19:07 albertz

returnn
returnn copied to clipboard

Dataset: implement global `dataset_distribution` option

returnn returnn copied to clipboard

Dataset: implement global `dataset_distribution` option

returnn
returnn copied to clipboard