ColossalAI
ColossalAI copied to clipboard
[BUG]: torch.distributed.elastic.rendezvous.dynamic_rendezvous
🐛 Describe the bug
When i use 2 nodes(16gpus) to do roberta/pretraining,i meet this error, can you help me?
Environment
cuda 11.6 python 3.10 pytorch 1.12.1
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xxxxxx' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
i got the same problem, have you solved it now?
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xxxxxx' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
Hi @xyease , what was your command?