examples
examples copied to clipboard
DistributedDataParralle training speed
Hi, I am using image net. But there is no big difference between the time consuming when I used 2 GPUs, 4 GPUs or 8 GPUs. I just changed the gpu id
and batch size to guarantee the memory of GPU was fully used. The speed did not increase although I used more GPUs. Is there something wrong about what I did?
Thanks a lot.
I also trained on the imagenet dataset using ResNet50 in Pytorch and found 2.85 days for 90 epoch using 1x GPU (GTX 1080 Ti) and 3.25 days using 4x GPU (GTX 1080 Ti). I also followed the instruction $ python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders]
I have the same question. I ran the imagenet code but found no acceleration when using multiple GPUs.
I solved the problem by improving network bandwidth.
I ran a benchmark to test the performance of a linux workstation. I found that adding additional GPUs to the task (training imagenet) increased the number of images processed per second so that using 2 GPUs processed 1.6 times more images per second and using 4 GPUs processed about 3.5 times more images per second. Note, this benchmark tested for the average number of images processed per second over the first 100 iterations of training. Nevertheless, I am trying to resolve why previous results showed that the time to complete 90 epoch did not decrease when more GPUs were assigned to the task.
Try adding "-j 16" to the command. For the imagenet distributed training, the default number of dataloader workers is 4. This controls the number of processes used to feed data. For multiple GPUs, this default number becomes a bottleneck. On my system with 8 GPUs and 80 CPU cores, when I increase the number of dataloader workers to 16 by setting "-j 16", the training speed of resnet50 doubled, goes from 800 img/s to 1600 img/s. Typically, you can gradually increase this worker number and stop when you see no speedup. A larger worker number requires more CPU cores.