OpenNMT-py Training Stuck at start: Training on 2 GPUs

Hi all, I have a machine with 2 GPUs. And I train an experiment using: CUDA_VISIBLE_DEVICES=0,1
onmt_train \ ... -world_size 2
-gpu_ranks 0 1 \

Preveiously, training an experiment on two GPUs works fine. However, starting from today that the training get stuck right at the start of the training: [2021-01-04 18:01:31,220 INFO] Starting training on GPU: [0, 1] [2021-01-04 18:01:31,220 INFO] Start training loop and validate every 10000 steps... (then there is no updates)

May you share some insights with how to fix this?

Jan 04 '21 23:01 i55code

Hi, I tested it again today and want to update this thread with more information. Today, I launched the same script to train on 2 GPUs and it works. And I really appreciate some feed back for this strange behaviour. Sometimes, training on 2 GPUs works, and sometimes it does not work. It is the same script, I really would appreciate some feedback to see what could have gone wrong. Thanks!

Jan 05 '21 15:01 i55code

I'm not sure what could be happening here. Maybe some strange behavior of NCCL. We had some cases in the past where training would get stuck (not right at beginning though).