YOLOX icon indicating copy to clipboard operation
YOLOX copied to clipboard

Multi node training question

Open PurvangL opened this issue 2 years ago • 7 comments

Hi, I am trying to train YOLOX on 2 nodes, each with 8 gpus. both servers can be can be connect with ssh.

after starting multinode script, it initializes gpus and then it hangs and doesn't move.

environment

python 3.8.10 torch 1.13.1+cu116 torchvision 0.14.1+cu116 cuda 11.6 libcudnn8 8.4.1.50-1

Thanks

PurvangL avatar May 20 '23 14:05 PurvangL

i trying the same and its not working

Lifeguard-alex avatar May 20 '23 16:05 Lifeguard-alex

its look like this project is dad , no one answering in the git

Lifeguard-alex avatar May 25 '23 10:05 Lifeguard-alex

@Lifeguard-alex were you able to run using more than one node?

PurvangL avatar Jul 05 '23 17:07 PurvangL

@FateScript @Joker316701882

PurvangL avatar Jul 05 '23 20:07 PurvangL

@PurvangL What is the command that you gave in both the machines?

I was getting similar error due to wrong values that I gave in num_machines and machine_rank

deepakcrk avatar Aug 23 '23 07:08 deepakcrk

same issue. Multi-gpu doesn't seem to work at all.

ksaluja15 avatar Sep 13 '23 00:09 ksaluja15

related issue: https://github.com/Megvii-BaseDetection/YOLOX/issues/1316

ksaluja15 avatar Sep 13 '23 00:09 ksaluja15