YOLOX Multi node training question

Hi, I am trying to train YOLOX on 2 nodes, each with 8 gpus. both servers can be can be connect with ssh.

after starting multinode script, it initializes gpus and then it hangs and doesn't move.

environment

python 3.8.10 torch 1.13.1+cu116 torchvision 0.14.1+cu116 cuda 11.6 libcudnn8 8.4.1.50-1

Thanks

May 20 '23 14:05 PurvangL

i trying the same and its not working

May 20 '23 16:05 Lifeguard-alex

its look like this project is dad , no one answering in the git

May 25 '23 10:05 Lifeguard-alex

@Lifeguard-alex were you able to run using more than one node?

Jul 05 '23 17:07 PurvangL

@FateScript @Joker316701882

Jul 05 '23 20:07 PurvangL

@PurvangL What is the command that you gave in both the machines?

I was getting similar error due to wrong values that I gave in num_machines and machine_rank

Aug 23 '23 07:08 deepakcrk

same issue. Multi-gpu doesn't seem to work at all.

Sep 13 '23 00:09 ksaluja15

related issue: https://github.com/Megvii-BaseDetection/YOLOX/issues/1316

Sep 13 '23 00:09 ksaluja15