examples
examples copied to clipboard
multigpu_torchrun.py does not show speed up when training on multi GPUs!
I have tried the same example provided on multigpu_torchrun.py and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases.
I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing.
Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.
Here is the slurm file content I am using:
#SBATCH --job-name=4gp #SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out #SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err #SBATCH --mem=24G # Job memory request #SBATCH --gres=gpu:4 # Number of requested GPU(s) #SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec #SBATCH --constraint=rtx_6000 # Specific hardware constraint
nvidia-smi
torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64