InternEvo icon indicating copy to clipboard operation
InternEvo copied to clipboard

[Bug] RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Open kkscilife opened this issue 1 year ago • 0 comments

Describe the bug

It's a probabilistic occurrence, Socket Timeout when group.allreduce([tensor], opts) if group in _world.pg_coalesce_state.keys(): # We are in coalescing context, do not issue single operation, just append a collective representation coll = _CollOp(all_reduce, tensor, None, op, None) _world.pg_coalesce_state[group].append(coll) _world.pg_coalesce_state[group].append(coll) _world.pg_coalesce_state[group].append(coll) _world.pg_coalesce_state[group].append(coll) if async_op: return _IllegalWork() else: return None

  work = group.allreduce([tensor], opts)

E RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout E Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:445 (most recent call first): More information: https://github.com/InternLM/InternEvo/actions/runs/10012995770/job/27889229378

Environment

python3.10 torch2.1

Other information

No response

kkscilife avatar Jul 25 '24 05:07 kkscilife