gloo icon indicating copy to clipboard operation
gloo copied to clipboard

RDMA_FORK_ SAFE

Open visualxu opened this issue 2 years ago • 0 comments

hi, I wrote a communication framework on our company's self-developed GPGPU, using the IB interface of GLOO. when using torch.utils.data.dataloader which forks many processes. I got following error: gloo/transport/ibverbs/pair.cc:438] wc->status == IBV_WC_SUCCESS. 5 vs 0. Send for slot 0: Work Request Flushed Error After debugging, I found that this problem was caused by fork's incomplete support for libibverbs. https://www.rdmamojo.com/2012/05/24/ibv_fork_init/ I think we need to prompt users who are using the Infiniband interface to set the environment variable RDMA_FORK_SAFE or IBV_ FORK_SAFE, or call this interface when initializing IB like nccl (gloo/ibverbs/device. cc).

visualxu avatar Sep 09 '22 11:09 visualxu