GLM-130B icon indicating copy to clipboard operation
GLM-130B copied to clipboard

NCCL RuntimeError

Open edwardelric1202 opened this issue 1 year ago • 1 comments

After run successfully and passed several minutes, it occured this error:

RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805457 milliseconds before timing out.

Did it caused by some timeout settings? If not, what's the problem?

edwardelric1202 avatar Apr 23 '23 02:04 edwardelric1202

Same problem here when running int4 version :(

wuyn639 avatar May 25 '23 02:05 wuyn639