flux icon indicating copy to clipboard operation
flux copied to clipboard

[QUESTION]IF FLUX supports RoCE NIC?

Open xuzhenguoloveyjh opened this issue 10 months ago • 1 comments

description When I was conducting cross-node test using NVSHMEM, I encountered a bug. the script I used is test_ag_kernel_crossnode.py, the error as below.

/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:422: non-zero status: 110 ibv_modify_qp failed
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1437: non-zero status: 7 ep_connect failed
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1504: non-zero status: 7 transport create connect failed
/flux/3rdparty/nvshmem/src/host/transport/transport.cpp:394: non-zero status: 7 connect EPS failed
/flux/3rdparty/nvshmem/src/host/init/init.cu:981: non-zero status: 7 nvshmem setup connections failed

I tests on a 2-node A100 cluster with each node has 8 GPUs and RoCE NIC. What could be the possible reasons?

xuzhenguoloveyjh avatar Feb 15 '25 14:02 xuzhenguoloveyjh

it's not tested on RoCE NIC. maybe this is a problem with NVSHMEM. can you run nvshmem examples with nvshmrun on RoCE NIC?

houqi avatar Mar 06 '25 23:03 houqi