SLAM-LLM icon indicating copy to clipboard operation
SLAM-LLM copied to clipboard

Deepspeed training dataset does not have sampler

Open lzl-mt opened this issue 9 months ago • 0 comments

System Info

torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

When using Deepspeed training, compared with DDP training with the same configuration, the total number of steps in each epoch training increased by N times (N is the number of cards). When printing the relevant configuration of the dataset, it was found that there is no sampler. DDP: {'sampler': <torch.utils.data.distributed.DistributedSampler object at 0x7fc99032c640>, 'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fc275f34130>>} Deepspeed: {'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fbee2e324c0>>} It may cause that the data read by each card is exactly the same.

Error logs

the same as above

Expected behavior

Thank you for your outstanding work. Hope can fix this problem and give the time required for Deepspeed and DDP to train an epoch on the default configuration of LIbrispeech. Thx a lot!:D

lzl-mt avatar May 28 '24 13:05 lzl-mt