MagicDance icon indicating copy to clipboard operation
MagicDance copied to clipboard

Bug in multi-gpu training

Open Pixie8888 opened this issue 1 year ago • 1 comments

Dear Author,

Thank you for sharing source code! But I find that when using CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --master_port 10000 --nproc_per_node 8 train_tiktok.py , every gpu reads the same data in every iteration. I think you should add sampler for DDP in your code.

Pixie8888 avatar Nov 23 '24 18:11 Pixie8888