examples multigpu_torchrun.py does not show speed up when training on multi GPUs!

multigpu_torchrun.py does not show speed up when training on multi GPUs!

Open MostafaCham opened this issue 1 year ago • 0 comments

I have tried the same example provided on multigpu_torchrun.py and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases.

I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing.

Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.

Here is the slurm file content I am using:

#SBATCH --job-name=4gp #SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out #SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err #SBATCH --mem=24G # Job memory request #SBATCH --gres=gpu:4 # Number of requested GPU(s) #SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec #SBATCH --constraint=rtx_6000 # Specific hardware constraint

nvidia-smi

torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64

Nov 04 '24 04:11 MostafaCham

examples examples copied to clipboard

multigpu_torchrun.py does not show speed up when training on multi GPUs!

examples
examples copied to clipboard