Multi node training question
Hi, I am trying to train YOLOX on 2 nodes, each with 8 gpus. both servers can be can be connect with ssh.
after starting multinode script, it initializes gpus and then it hangs and doesn't move.
environment
python 3.8.10 torch 1.13.1+cu116 torchvision 0.14.1+cu116 cuda 11.6 libcudnn8 8.4.1.50-1
Thanks
i trying the same and its not working
its look like this project is dad , no one answering in the git
@Lifeguard-alex were you able to run using more than one node?
@FateScript @Joker316701882
@PurvangL What is the command that you gave in both the machines?
I was getting similar error due to wrong values that I gave in num_machines and machine_rank
same issue. Multi-gpu doesn't seem to work at all.
related issue: https://github.com/Megvii-BaseDetection/YOLOX/issues/1316