OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

Training Stuck at start: Training on 2 GPUs

Open i55code opened this issue 4 years ago • 8 comments

Hi all, I have a machine with 2 GPUs. And I train an experiment using: CUDA_VISIBLE_DEVICES=0,1
onmt_train \ ... -world_size 2
-gpu_ranks 0 1 \

Preveiously, training an experiment on two GPUs works fine. However, starting from today that the training get stuck right at the start of the training: [2021-01-04 18:01:31,220 INFO] Starting training on GPU: [0, 1] [2021-01-04 18:01:31,220 INFO] Start training loop and validate every 10000 steps... (then there is no updates)

May you share some insights with how to fix this?

i55code avatar Jan 04 '21 23:01 i55code

Hi, I tested it again today and want to update this thread with more information. Today, I launched the same script to train on 2 GPUs and it works. And I really appreciate some feed back for this strange behaviour. Sometimes, training on 2 GPUs works, and sometimes it does not work. It is the same script, I really would appreciate some feedback to see what could have gone wrong. Thanks!

i55code avatar Jan 05 '21 15:01 i55code

I'm not sure what could be happening here. Maybe some strange behavior of NCCL. We had some cases in the past where training would get stuck (not right at beginning though).

francoishernandez avatar Jan 05 '21 16:01 francoishernandez

Also, which version of pytorch are you using?

francoishernandez avatar Jan 05 '21 16:01 francoishernandez

I have the same question, I find that pytorch 1.7 has some strange behaviors of NCCL, which is fixed in 1.8

chijianlei avatar Mar 15 '21 13:03 chijianlei

Hi, I also have the same question, Do you know why? pytorch=1.5 OpenNMT=0.4 Python=3.6 Thanks!

Zhw098 avatar Nov 30 '21 03:11 Zhw098

It seems that the version you chose is too old.

chijianlei avatar Nov 30 '21 12:11 chijianlei

It seems that the version you chose is too old.

I am very happy to receive your reply. Which version do you use? I am a new opennmt-er. Thanks!

Zhw098 avatar Dec 01 '21 07:12 Zhw098

Hi, Can anyone help to resolve this issue?

SamraMehboob avatar Apr 08 '22 11:04 SamraMehboob

reopen if you face the issue again

vince62s avatar Oct 27 '22 13:10 vince62s