SLAM-LLM Deepspeed training dataset does not have sampler

Deepspeed training dataset does not have sampler

Open lzl-mt opened this issue 9 months ago • 0 comments

System Info

torch 2.0.1 torchaudio 2.0.2 torchvision 0.15.2

Information

[ ] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

When using Deepspeed training, compared with DDP training with the same configuration, the total number of steps in each epoch training increased by N times (N is the number of cards). When printing the relevant configuration of the dataset, it was found that there is no sampler. DDP: {'sampler': <torch.utils.data.distributed.DistributedSampler object at 0x7fc99032c640>, 'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fc275f34130>>} Deepspeed: {'batch_size': 6, 'drop_last': True, 'collate_fn': <bound method SpeechDatasetJsonl.collator of <speech_dataset.py.SpeechDatasetJsonl object at 0x7fbee2e324c0>>} It may cause that the data read by each card is exactly the same.

Error logs

the same as above

Expected behavior

Thank you for your outstanding work. Hope can fix this problem and give the time required for Deepspeed and DDP to train an epoch on the default configuration of LIbrispeech. Thx a lot！:D

May 28 '24 13:05 lzl-mt

SLAM-LLM SLAM-LLM copied to clipboard

Deepspeed training dataset does not have sampler

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

SLAM-LLM
SLAM-LLM copied to clipboard