pytorch-CycleGAN-and-pix2pix icon indicating copy to clipboard operation
pytorch-CycleGAN-and-pix2pix copied to clipboard

Multi GPU speed ?

Open gabgren opened this issue 3 years ago • 8 comments

Hi!

I was under the assumption that using multiple gpus for training pix2pix would result in a faster training, but this is not what I am experiencing. In fact I get slower speeds, the best I can do is keeping the s/it more or less the same as with 1 gpu.

For testing, I was using batch_size 8 for single gpu and batch_size 64 for 8 gpus. Tests were done on 8x A6000 and 8x 3090. I have also tested setting norm to instance and batch with no effect.

What am I doing wrong, or getting wrong ? Am I right to expect to train faster with more GPUs or is it that by using multiple gpu_ids i get to train higher resolution ?

Thanks !

gabgren avatar Jun 10 '22 19:06 gabgren

Could you check if the GPU utilization is at 100%? It could be because the data loader does not feed training images fast enough. Another possibility is that the progress in the total number of images used for training is actually faster with more GPUs, but if you are monitoring the number of iterations, it won't be different.

taesungp avatar Jun 14 '22 19:06 taesungp

Looks like its your first theory: it takes a long time feeding the 8 gpus. the actual processing seems to be faster, but is slowed down between iterations. See this comparison between the GPU utilization of 1xA6000 vs 8xA6000: 1gpu 8gpus

How can I speed this up ?

gabgren avatar Jun 29 '22 15:06 gabgren

it might be a data loading issue. You may want to use SSD or other fast file systems.

junyanz avatar Jul 04 '22 17:07 junyanz

I have 4 GPUs and want to use these 4 GPUs for accelerated training at the same time, how can I modify the code? At present, it can only be trained on one GPU, and the training speed is very slow, thank you!

malinjie-hub avatar Aug 11 '22 07:08 malinjie-hub

@gabgren I have 4 GPUs and want to use these 4 GPUs for accelerated training at the same time, how can I modify the code? At present, it can only be trained on one GPU, and the training speed is very slow, --gpu_ids 0,1,2,3 does not work,thank you!

icelandno1 avatar Aug 29 '22 09:08 icelandno1

What is your batch_size? By mentioning "does not work", are you referring to (1) the model is only trained on one GPU, or (2) the model is trained on multiple GPUs, but the training speed is not as fast as you expect?

junyanz avatar Aug 30 '22 20:08 junyanz

@junyanz batch_size is "4", after the use of --gpu_ids 0,1,2,3, the model is only trained on one GPU

icelandno1 avatar Aug 31 '22 02:08 icelandno1

This could be because of the limitation of nn.DataParallel we use here, which was a common approach when we published the git repo. But it does suffer from suboptimal GPU utilization because the data loading is inefficient. A better way would be utilizing DistributedDataParallel link. We don't plan to support this for now, but if someone could create a PR I'd appreciate it.

taesungp avatar Sep 06 '22 20:09 taesungp