returnn icon indicating copy to clipboard operation
returnn copied to clipboard

DistributeFilesDataset _num_shards issue

Open Judyxujj opened this issue 11 months ago • 4 comments

For the latest RETURNN, when I use DistributeFilesDataset, I have this error.

File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/basic.py", line 227, in Dataset._create_from_reduce                     
  line: ds = cls(**kwargs)                                                                                                                     
  locals:                                                                                                                              
   ds = <not found>                                                                                                                        
   cls = <local> <class 'returnn.datasets.distrib_files.DistributeFilesDataset'>                                                                                          
   kwargs = <local> {'files': ['/ssd/jxu/nas/data/speech/FR_FR/16kHz/EPPS/corpus/batch.1.v1/hdf-raw_wav.16kHz.split-25/EPPS-batch.1.v1.hdf.15', '/ssd/jxu/nas/data/speech/EN_US/16kHz/NEWS.HQ/corpus/batch.2.NPR.v3/hdf-raw_wav.16kHz.split-261/NEWS.HQ-batch.2.NP
R.v3.hdf.7', '/ssd/jxu/nas/data/speech/IT_IT/16kHz/IT.parli..., len = 25                                                                                               
 File "/nas/models/asr/am/multilingual/16kHz/2024-11-08--jxu-best-rq-pretrain/work/i6_core/tools/git/CloneGitRepositoryJob.LD5f1wKK7LPo/output/returnn/returnn/datasets/distrib_files.py", line 171, in DistributeFilesDataset.__init__               
  line: assert self._num_shards == 1 and self._shard_index == 0, ( # ensure defaults are set                                                                                    
       f"{self}: Cannot use both dataset-sharding via properties _num_shards and _shard index "                                                                                
       f"and {self.__class__.__name__}'s own sharding implementation based on the trainings rank and size."                                                                          
     )                                                                                                                              
  locals:                                                                                                                              
   self = <local> <DistributeFilesDataset 'train' epoch=None>                                                                                                   
   self._num_shards = <local> 8                                                                                                                  
   self._shard_index = <local> 6 

The DistributeFilesDataset is inherited from CachedDataset2, which is again inherited from Dataset, the the _num_shards should be set to 1 in the init function. I am not sure how self._num_shards is changed to num of gpus in my case.

(cc @NeoLegends, @michelwi)

Judyxujj avatar Jan 20 '25 10:01 Judyxujj

As you told me, before, you used a RETURNN version from 2024-07, where it was working fine.

albertz avatar Jan 20 '25 10:01 albertz

What is the dataset config? What is the training config (distributed setting)?

albertz avatar Jan 20 '25 10:01 albertz

What is the dataset config? What is the training config (distributed setting)?

the dataset config is like this

train = {                                                                                                                          
    "buffer_size": 100,                                                                                        
    "class": "DistributeFilesDataset",                                                                                                   
    "distrib_shard_files": True,  
    "get_sub_epoch_dataset": get_sub_epoch_dataset,                                                                             
    "partition_epoch": 200,                                                                                   
    "seq_ordering": "random"
}

Judyxujj avatar Jan 20 '25 13:01 Judyxujj

The issue was likely introduced in https://github.com/rwth-i6/returnn/pull/1630, which added the _num_shards and _shard_index as __init__ parameters to Dataset. Parameters w/ a leading underscore are pickled by Dataset.__reduce__, which means the values were carried over into subprocesses. In there the assertion in DistributeFilesDataset.__init__ is triggered as the init does not know about the pickling process. I believe these properties simply shouldn't be pickled for DistributeFilesDataset. #1676 is going to be a fix, but I can also ship a smaller fix (#1685) before the larger #1676 lands.

NeoLegends avatar Feb 04 '25 13:02 NeoLegends

I just pushed a simple fix for this. Can you check whether it works now?

albertz avatar Jul 17 '25 09:07 albertz