flux
flux copied to clipboard
[QUESTION]IF FLUX supports RoCE NIC?
description When I was conducting cross-node test using NVSHMEM, I encountered a bug. the script I used is test_ag_kernel_crossnode.py, the error as below.
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:422: non-zero status: 110 ibv_modify_qp failed
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1437: non-zero status: 7 ep_connect failed
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1504: non-zero status: 7 transport create connect failed
/flux/3rdparty/nvshmem/src/host/transport/transport.cpp:394: non-zero status: 7 connect EPS failed
/flux/3rdparty/nvshmem/src/host/init/init.cu:981: non-zero status: 7 nvshmem setup connections failed
I tests on a 2-node A100 cluster with each node has 8 GPUs and RoCE NIC. What could be the possible reasons?
it's not tested on RoCE NIC. maybe this is a problem with NVSHMEM. can you run nvshmem examples with nvshmrun on RoCE NIC?