examples icon indicating copy to clipboard operation
examples copied to clipboard

DistributedDataParralle training speed

Open ElegantLin opened this issue 5 years ago • 5 comments

Hi, I am using image net. But there is no big difference between the time consuming when I used 2 GPUs, 4 GPUs or 8 GPUs. I just changed the gpu id and batch size to guarantee the memory of GPU was fully used. The speed did not increase although I used more GPUs. Is there something wrong about what I did?

Thanks a lot.

ElegantLin avatar Aug 10 '19 16:08 ElegantLin

I also trained on the imagenet dataset using ResNet50 in Pytorch and found 2.85 days for 90 epoch using 1x GPU (GTX 1080 Ti) and 3.25 days using 4x GPU (GTX 1080 Ti). I also followed the instruction $ python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders]

atunick avatar Nov 19 '19 17:11 atunick

I have the same question. I ran the imagenet code but found no acceleration when using multiple GPUs.

ypengc7512 avatar Dec 11 '19 11:12 ypengc7512

I solved the problem by improving network bandwidth.

ElegantLin avatar Dec 17 '19 09:12 ElegantLin

I ran a benchmark to test the performance of a linux workstation. I found that adding additional GPUs to the task (training imagenet) increased the number of images processed per second so that using 2 GPUs processed 1.6 times more images per second and using 4 GPUs processed about 3.5 times more images per second. Note, this benchmark tested for the average number of images processed per second over the first 100 iterations of training. Nevertheless, I am trying to resolve why previous results showed that the time to complete 90 epoch did not decrease when more GPUs were assigned to the task.

atunick avatar Dec 17 '19 16:12 atunick

Try adding "-j 16" to the command. For the imagenet distributed training, the default number of dataloader workers is 4. This controls the number of processes used to feed data. For multiple GPUs, this default number becomes a bottleneck. On my system with 8 GPUs and 80 CPU cores, when I increase the number of dataloader workers to 16 by setting "-j 16", the training speed of resnet50 doubled, goes from 800 img/s to 1600 img/s. Typically, you can gradually increase this worker number and stop when you see no speedup. A larger worker number requires more CPU cores.

sevenquarkoniums avatar Oct 11 '22 11:10 sevenquarkoniums