gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

My servers used for multi-node training do not have ssh. How can I launch multi-node training using the torchrun command?

Open dingning97 opened this issue 4 months ago • 2 comments

My machines used for multi-node training do not allow ssh service. How can I launch multi-node training using the most basic torchrun command (torch.distributed.launch) ?

The servers which I use do not have slurm. And I found both openmpi and pdsh rely on ssh service. So I run out of all the ways provided in this repo's README to start a multi-node training job.

dingning97 avatar Apr 23 '24 09:04 dingning97

I also encountered the same problem. Have you found a solution?

WKX933 avatar Apr 30 '24 07:04 WKX933