GLM-130B
GLM-130B copied to clipboard
NCCL RuntimeError
After run successfully and passed several minutes, it occured this error:
RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805457 milliseconds before timing out.
Did it caused by some timeout settings? If not, what's the problem?
Same problem here when running int4 version :(