stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

NET/IB : Got completion from peer 11.214.147.122<39138> with error 12, opcode 0, len 0, vendor err 129

Open lmx760581375 opened this issue 1 year ago • 1 comments

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: remote process exited or there was a network error, NCCL version 2.14.3 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: NET/IB : Got completion from peer 11.214.147.122<39138> with error 12, opcode 0, len 0, vendor err 129

lmx760581375 avatar Apr 13 '23 13:04 lmx760581375