transformers Trainer using only one GPU instead of two

System Info

transformers version 4.26.0 python version 3.8.8 pytorch version 1.9.0+cu102

Who can help?

trainer: @sgugger, @muellerzr and @pacman100

Reproduction

I am trying to train a T5 model using two gpus but for some reason the trainer only uses one?

in my bash file i specified the number of GPUs i wanna use like this: #SBATCH --gres=gpu:2

and in my code i added this:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

but when i check the number of GPUs in the training argument i always get 1: print("build trainer with on device:", training_args.device, "with n gpus:", training_args.n_gpu)

the output: build trainer with on device: cuda:0 with n gpus: 1

Expected behavior

I want to use all the available GPUs

Mar 29 '24 11:03 oussaidene

How are you launching the python script in your bash?

Mar 29 '24 12:03 muellerzr

this is the content of my bash script:

#!/bin/sh
# Options SBATCH :

#SBATCH --job-name=DSI_gpu   # Job  name
#SBATCH --mail-type=END             # Email notification 
#SBATCH [email protected]    

#SBATCH --ntasks=1                      # Number of paralel jobs
#SBATCH --cpus-per-task=4
#SBATCH --partition=GPUNodes            #  partition
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding

# Traitement


module purge
module load singularity/3.0.3

srun singularity exec /logiciels/containerCollections/CUDA11/pytorch-NGC-21-03-py3.sif $HOME/dr_env/bin/python3.8 "path/to/python/script.py"

Mar 29 '24 12:03 oussaidene

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 29 '24 08:04 github-actions[bot]

You're launching with python. You should use either accelerate launch or torch.distributed.run otherwise you'll get model parallel (which isn't what you're aiming for)

Apr 29 '24 17:04 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 24 '24 08:05 github-actions[bot]

transformers transformers copied to clipboard

Trainer using only one GPU instead of two

System Info

Who can help?

Reproduction

Expected behavior

transformers
transformers copied to clipboard