SLAM-LLM
SLAM-LLM copied to clipboard
Deepspeed training dataset does not have sampler
System Info
torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2
Information
- [ ] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
When using Deepspeed training, compared with DDP training with the same configuration, the total number of steps in each epoch training increased by N times (N is the number of cards). When printing the relevant configuration of the dataset, it was found that there is no sampler. DDP: {'sampler': <torch.utils.data.distributed.DistributedSampler object at 0x7fc99032c640>, 'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fc275f34130>>} Deepspeed: {'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fbee2e324c0>>} It may cause that the data read by each card is exactly the same.
Error logs
the same as above
Expected behavior
Thank you for your outstanding work. Hope can fix this problem and give the time required for Deepspeed and DDP to train an epoch on the default configuration of LIbrispeech. Thx a lot!:D