PyTorch-StudioGAN icon indicating copy to clipboard operation
PyTorch-StudioGAN copied to clipboard

Using 4 GPUs for training takes the same time as using just 1

Open MiguelCosta94 opened this issue 1 year ago • 1 comments

I'm training a BigGAN with differential augmentation and LeCam optimization on a custom dataset. My setup features 4 NVIDIA RTX 3070 and I'm running on Ubuntu 20.04. I observe that running the training on the 4 GPUs, using Distributed Data Parallel takes the same time as performing the training using a single GPU. Am I doing something wrong?

For training using a single GPU, I'm using the following command: CUDA_VISIBLE_DEVICES=0 python3 src/main.py -t -hdf5 -l -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"

For training using the 4 GPUs, I'm using the following commands: export MASTER_ADDR=localhost export MASTER_PORT=1234 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -DDP -tn 1 -cn 0 -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"

MiguelCosta94 avatar Dec 05 '23 19:12 MiguelCosta94

Could you please check the batch size used in the training process?

If you are using 1 GPU with a batch size of 256, it is advisable to switch to 4 GPUs, each with a batch size of 64, in order to accelerate training. It's important not to use the 256 batch size for each GPU for faster training.

mingukkang avatar Jan 31 '24 07:01 mingukkang