lightning-hydra-template icon indicating copy to clipboard operation
lightning-hydra-template copied to clipboard

Training stuck when submitting job to slurm with multigpu and ddp

Open sri9s opened this issue 1 year ago • 1 comments

The training is stuck and I get the error

The client socket has failed to connect to [ip6-localhost]:24355 (errno: 99 - Cannot assign requested address)

Need help with this.

sri9s avatar Aug 20 '23 11:08 sri9s

same question

patrick-tssn avatar Dec 12 '23 10:12 patrick-tssn