OpenNMT-py
OpenNMT-py copied to clipboard
Training Stuck at start: Training on 2 GPUs
Hi all, I have a machine with 2 GPUs. And I train an experiment using: CUDA_VISIBLE_DEVICES=0,1
onmt_train \
...
-world_size 2
-gpu_ranks 0 1 \
Preveiously, training an experiment on two GPUs works fine. However, starting from today that the training get stuck right at the start of the training: [2021-01-04 18:01:31,220 INFO] Starting training on GPU: [0, 1] [2021-01-04 18:01:31,220 INFO] Start training loop and validate every 10000 steps... (then there is no updates)
May you share some insights with how to fix this?
Hi, I tested it again today and want to update this thread with more information. Today, I launched the same script to train on 2 GPUs and it works. And I really appreciate some feed back for this strange behaviour. Sometimes, training on 2 GPUs works, and sometimes it does not work. It is the same script, I really would appreciate some feedback to see what could have gone wrong. Thanks!
I'm not sure what could be happening here. Maybe some strange behavior of NCCL. We had some cases in the past where training would get stuck (not right at beginning though).
Also, which version of pytorch are you using?
I have the same question, I find that pytorch 1.7 has some strange behaviors of NCCL, which is fixed in 1.8
Hi, I also have the same question, Do you know why? pytorch=1.5 OpenNMT=0.4 Python=3.6 Thanks!
It seems that the version you chose is too old.
It seems that the version you chose is too old.
I am very happy to receive your reply. Which version do you use? I am a new opennmt-er. Thanks!
Hi, Can anyone help to resolve this issue?
reopen if you face the issue again