DistributeFilesDataset _num_shards issue
For the latest RETURNN, when I use DistributeFilesDataset, I have this error.
File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/basic.py", line 227, in Dataset._create_from_reduce
line: ds = cls(**kwargs)
locals:
ds = <not found>
cls = <local> <class 'returnn.datasets.distrib_files.DistributeFilesDataset'>
kwargs = <local> {'files': ['/ssd/jxu/nas/data/speech/FR_FR/16kHz/EPPS/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-25/EPPS-batch.1.v1.hdf.15', '/ssd/jxu/nas/data/speech/EN_US/16kHz/NEWS.HQ/corpus/batch.2.NPR.v3/hdf-raw_wav.16kHz.split-261/NEWS.HQ-batch.2.NP
R.v3.hdf.7', '/ssd/jxu/nas/data/speech/IT_IT/16kHz/IT.parli..., len = 25
File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/distrib_files.py", line 171, in DistributeFilesDataset.__init__
line: assert self._num_shards == 1 and self._shard_index == 0, ( # ensure defaults are set
f"{self}: Cannot use both dataset-sharding via properties _num_shards and _shard index "
f"and {self.__class__.__name__}'s own sharding implementation based on the trainings rank and size."
)
locals:
self = <local> <DistributeFilesDataset 'train' epoch=None>
self._num_shards = <local> 8
self._shard_index = <local> 6
The DistributeFilesDataset is inherited from CachedDataset2, which is again inherited from Dataset, the the _num_shards should be set to 1 in the init function. I am not sure how self._num_shards is changed to num of gpus in my case.
(cc @NeoLegends, @michelwi)
As you told me, before, you used a RETURNN version from 2024-07, where it was working fine.
What is the dataset config? What is the training config (distributed setting)?
What is the dataset config? What is the training config (distributed setting)?
the dataset config is like this
train = {
"buffer_size": 100,
"class": "DistributeFilesDataset",
"distrib_shard_files": True,
"get_sub_epoch_dataset": get_sub_epoch_dataset,
"partition_epoch": 200,
"seq_ordering": "random"
}
The issue was likely introduced in https://github.com/rwth-i6/returnn/pull/1630, which added the _num_shards and _shard_index as __init__ parameters to Dataset. Parameters w/ a leading underscore are pickled by Dataset.__reduce__, which means the values were carried over into subprocesses. In there the assertion in DistributeFilesDataset.__init__ is triggered as the init does not know about the pickling process. I believe these properties simply shouldn't be pickled for DistributeFilesDataset. #1676 is going to be a fix, but I can also ship a smaller fix (#1685) before the larger #1676 lands.
I just pushed a simple fix for this. Can you check whether it works now?