transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Trainer using only one GPU instead of two

Open oussaidene opened this issue 1 year ago • 5 comments

System Info

transformers version 4.26.0 python version 3.8.8 pytorch version 1.9.0+cu102

Who can help?

trainer: @sgugger, @muellerzr and @pacman100

Reproduction

I am trying to train a T5 model using two gpus but for some reason the trainer only uses one?

in my bash file i specified the number of GPUs i wanna use like this: #SBATCH --gres=gpu:2

and in my code i added this:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

but when i check the number of GPUs in the training argument i always get 1: print("build trainer with on device:", training_args.device, "with n gpus:", training_args.n_gpu)

the output: build trainer with on device: cuda:0 with n gpus: 1

Expected behavior

I want to use all the available GPUs

oussaidene avatar Mar 29 '24 11:03 oussaidene

How are you launching the python script in your bash?

muellerzr avatar Mar 29 '24 12:03 muellerzr

this is the content of my bash script:

#!/bin/sh
# Options SBATCH :

#SBATCH --job-name=DSI_gpu   # Job  name
#SBATCH --mail-type=END             # Email notification 
#SBATCH [email protected]    

#SBATCH --ntasks=1                      # Number of paralel jobs
#SBATCH --cpus-per-task=4
#SBATCH --partition=GPUNodes            #  partition
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding

# Traitement


module purge
module load singularity/3.0.3

srun singularity exec /logiciels/containerCollections/CUDA11/pytorch-NGC-21-03-py3.sif $HOME/dr_env/bin/python3.8 "path/to/python/script.py"


oussaidene avatar Mar 29 '24 12:03 oussaidene

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 29 '24 08:04 github-actions[bot]

You're launching with python. You should use either accelerate launch or torch.distributed.run otherwise you'll get model parallel (which isn't what you're aiming for)

muellerzr avatar Apr 29 '24 17:04 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 24 '24 08:05 github-actions[bot]