torchpack icon indicating copy to clipboard operation
torchpack copied to clipboard

Multi Node training

Open AlexIlis opened this issue 1 year ago • 1 comments

Can you suggest how to implement multi gpu - multi node training with torchpack ?

I have set -H ip1:gpus,ip2:gpus and launched the train from both the nodes, however they don't seem to be getting a handle of one another. What am I missing here ?

AlexIlis avatar Oct 24 '23 16:10 AlexIlis

Could you try to SSH into ip1 and ip2? You need to make sure that these two machines can be SSH-ed into without password.

zhijian-liu avatar Dec 11 '23 03:12 zhijian-liu