ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: torch.distributed.elastic.rendezvous.dynamic_rendezvous

Open hujunchao opened this issue 2 years ago • 3 comments

🐛 Describe the bug

When i use 2 nodes(16gpus) to do roberta/pretraining,i meet this error, can you help me?

Environment

cuda 11.6 python 3.10 pytorch 1.12.1

hujunchao avatar Feb 16 '23 07:02 hujunchao

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xxxxxx' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.

hujunchao avatar Feb 16 '23 08:02 hujunchao

i got the same problem, have you solved it now?

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'xxxxxx' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.

xyease avatar Apr 17 '23 05:04 xyease

Hi @xyease , what was your command?

JThh avatar Apr 18 '23 11:04 JThh