when i installed nvshmem_src, nccl get error, how to fix?
W0429 15:49:19.326000 140046767908672 torch/distributed/run.py:779]
W0429 15:49:19.326000 140046767908672 torch/distributed/run.py:779] *****************************************
W0429 15:49:19.326000 140046767908672 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0429 15:49:19.326000 140046767908672 torch/distributed/run.py:779] *****************************************
[rank9]: Traceback (most recent call last):
[rank9]: File "/sharedata/msm/workspace/DeepEP/test_nccl_rank.py", line 30, in
config env var: NCCL_DEBUG=INFO NCCL will print detail log. Maybe you should make sure that your NCCL is installed correctly by using NCCL-TEST.
config env var: NCCL_DEBUG=INFO NCCL will print detail log. Maybe you should make sure that your NCCL is installed correctly by using NCCL-TEST.
msm-h200-2:4009:4009 [0] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4009:4009 [0] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4009:4009 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4009:4557 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4009:4557 [0] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4009:4557 [0] NCCL INFO Using network IB msm-h200-2:4013:4013 [4] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4013:4013 [4] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4013:4013 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4016:4016 [7] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4016:4016 [7] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4016:4016 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4011:4011 [2] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4011:4011 [2] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4011:4011 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4015:4015 [6] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4015:4015 [6] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4015:4015 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4014:4014 [5] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4014:4014 [5] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4014:4014 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4010:4010 [1] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4010:4010 [1] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4010:4010 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4013:4577 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4013:4577 [4] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4013:4577 [4] NCCL INFO Using network IB msm-h200-2:4012:4012 [3] NCCL INFO cudaDriverVersion 12040 msm-h200-2:4012:4012 [3] NCCL INFO Bootstrap : Using eth0:10.0.40.96<0> msm-h200-2:4012:4012 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation msm-h200-2:4016:4580 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4016:4580 [7] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4016:4580 [7] NCCL INFO Using network IB msm-h200-2:4015:4582 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4015:4582 [6] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4015:4582 [6] NCCL INFO Using network IB msm-h200-2:4011:4581 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4011:4581 [2] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4011:4581 [2] NCCL INFO Using network IB msm-h200-2:4010:4584 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4010:4584 [1] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4010:4584 [1] NCCL INFO Using network IB msm-h200-2:4014:4583 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4014:4583 [5] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4014:4583 [5] NCCL INFO Using network IB msm-h200-2:4012:4594 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [8]mlx5_8:1/RoCE [RO]; OOB eth0:10.0.40.96<0> msm-h200-2:4012:4594 [3] NCCL INFO Using non-device net plugin version 0 msm-h200-2:4012:4594 [3] NCCL INFO Using network IB msm-h200-2:4016:4580 [7] NCCL INFO comm 0x562a40b3c180 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId a7000 commId 0x982b74f14551db09 - Init START msm-h200-2:4011:4581 [2] NCCL INFO comm 0x561c9922db90 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 65000 commId 0x982b74f14551db09 - Init START msm-h200-2:4012:4594 [3] NCCL INFO comm 0x564bda190830 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 67000 commId 0x982b74f14551db09 - Init START msm-h200-2:4015:4582 [6] NCCL INFO comm 0x55e0d02bafe0 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId a5000 commId 0x982b74f14551db09 - Init START msm-h200-2:4010:4584 [1] NCCL INFO comm 0x56248e9155e0 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 63000 commId 0x982b74f14551db09 - Init START msm-h200-2:4013:4577 [4] NCCL INFO comm 0x5612a180cdd0 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId a1000 commId 0x982b74f14551db09 - Init START msm-h200-2:4014:4583 [5] NCCL INFO comm 0x5585c915a640 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId a3000 commId 0x982b74f14551db09 - Init START msm-h200-2:4009:4557 [0] NCCL INFO comm 0x55f70b0f5230 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 61000 commId 0x982b74f14551db09 - Init START msm-h200-2:4012:4594 [3] NCCL INFO MNNVL busId 0x67000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4015:4582 [6] NCCL INFO MNNVL busId 0xa5000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4009:4557 [0] NCCL INFO MNNVL busId 0x61000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4014:4583 [5] NCCL INFO MNNVL busId 0xa3000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4010:4584 [1] NCCL INFO MNNVL busId 0x63000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4011:4581 [2] NCCL INFO MNNVL busId 0x65000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4016:4580 [7] NCCL INFO MNNVL busId 0xa7000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4013:4577 [4] NCCL INFO MNNVL busId 0xa1000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0 msm-h200-2:4011:4581 [2] NCCL INFO Setting affinity for GPU 2 to 03ffffff,ffffffff,ffffffff msm-h200-2:4011:4581 [2] NCCL INFO NVLS multicast support is available on dev 2 msm-h200-2:4012:4594 [3] NCCL INFO Setting affinity for GPU 3 to 03ffffff,ffffffff,ffffffff msm-h200-2:4012:4594 [3] NCCL INFO NVLS multicast support is available on dev 3 msm-h200-2:4013:4577 [4] NCCL INFO Setting affinity for GPU 4 to 0fffff,ffffffff,ffffffff,fc000000,00000000,00000000 msm-h200-2:4013:4577 [4] NCCL INFO NVLS multicast support is available on dev 4 msm-h200-2:4014:4583 [5] NCCL INFO Setting affinity for GPU 5 to 0fffff,ffffffff,ffffffff,fc000000,00000000,00000000 msm-h200-2:4014:4583 [5] NCCL INFO NVLS multicast support is available on dev 5 msm-h200-2:4010:4584 [1] NCCL INFO Setting affinity for GPU 1 to 03ffffff,ffffffff,ffffffff msm-h200-2:4010:4584 [1] NCCL INFO NVLS multicast support is available on dev 1 msm-h200-2:4016:4580 [7] NCCL INFO Setting affinity for GPU 7 to 0fffff,ffffffff,ffffffff,fc000000,00000000,00000000 msm-h200-2:4016:4580 [7] NCCL INFO NVLS multicast support is available on dev 7 msm-h200-2:4009:4557 [0] NCCL INFO Setting affinity for GPU 0 to 03ffffff,ffffffff,ffffffff msm-h200-2:4009:4557 [0] NCCL INFO NVLS multicast support is available on dev 0 msm-h200-2:4015:4582 [6] NCCL INFO Setting affinity for GPU 6 to 0fffff,ffffffff,ffffffff,fc000000,00000000,00000000 msm-h200-2:4015:4582 [6] NCCL INFO NVLS multicast support is available on dev 6 msm-h200-2:4009:4557 [0] NCCL INFO comm 0x55f70b0f5230 rank 8 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0 msm-h200-2:4010:4584 [1] NCCL INFO comm 0x56248e9155e0 rank 9 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0 msm-h200-2:4011:4581 [2] NCCL INFO comm 0x561c9922db90 rank 10 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0 msm-h200-2:4010:4584 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->1 [2] -1/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] 10/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 [8] 10/-1/-1->9->8 [9] 10/1/-1->9->-1 [10] -1/-1/-1->9->8 [11] 10/-1/-1->9->8 [12] 10/-1/-1->9->8 [13] 10/-1/-1->9->8 [14] 10/-1/-1->9->8 [15] 10/-1/-1->9->8 msm-h200-2:4009:4557 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] -1/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->11 [4] 9/-1/-1->8->15 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->14 [8] 9/0/-1->8->-1 [9] -1/-1/-1->8->15 [10] 9/-1/-1->8->15 [11] 9/-1/-1->8->11 [12] 9/-1/-1->8->15 [13] 9/-1/-1->8->15 [14] 9/-1/-1->8->15 [15] 9/-1/-1->8->14 msm-h200-2:4010:4584 [1] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4009:4557 [0] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4012:4594 [3] NCCL INFO comm 0x564bda190830 rank 11 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0 msm-h200-2:4016:4580 [7] NCCL INFO comm 0x562a40b3c180 rank 15 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0 msm-h200-2:4015:4582 [6] NCCL INFO comm 0x55e0d02bafe0 rank 14 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0 msm-h200-2:4013:4577 [4] NCCL INFO comm 0x5612a180cdd0 rank 12 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0 msm-h200-2:4014:4583 [5] NCCL INFO comm 0x5585c915a640 rank 13 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0 msm-h200-2:4011:4581 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->9 [2] 11/-1/-1->10->2 [3] 12/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/-1/-1->10->9 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9 [8] 11/-1/-1->10->9 [9] 11/-1/-1->10->9 [10] 11/2/-1->10->-1 [11] 12/-1/-1->10->9 [12] 11/-1/-1->10->9 [13] 11/-1/-1->10->9 [14] 11/-1/-1->10->9 [15] 11/-1/-1->10->9 msm-h200-2:4012:4594 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 8/-1/-1->11->3 [4] -1/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] 12/-1/-1->11->10 [7] -1/-1/-1->11->10 [8] 12/-1/-1->11->10 [9] 12/-1/-1->11->10 [10] 12/-1/-1->11->10 [11] 8/3/-1->11->-1 [12] -1/-1/-1->11->10 [13] 12/-1/-1->11->10 [14] 12/-1/-1->11->10 [15] -1/-1/-1->11->10 msm-h200-2:4011:4581 [2] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4012:4594 [3] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4016:4580 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] -1/-1/-1->15->14 [4] 8/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 12/-1/-1->15->7 [8] -1/-1/-1->15->14 [9] 8/-1/-1->15->14 [10] 8/-1/-1->15->14 [11] -1/-1/-1->15->14 [12] 8/-1/-1->15->14 [13] 8/-1/-1->15->14 [14] 8/-1/-1->15->14 [15] 12/7/-1->15->-1 msm-h200-2:4015:4582 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->6 [7] 8/-1/-1->14->13 [8] 15/-1/-1->14->13 [9] 15/-1/-1->14->13 [10] 15/-1/-1->14->13 [11] 15/-1/-1->14->13 [12] 15/-1/-1->14->13 [13] 15/-1/-1->14->13 [14] 15/6/-1->14->-1 [15] 8/-1/-1->14->13 msm-h200-2:4016:4580 [7] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4013:4577 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->10 [4] 13/-1/-1->12->4 [5] -1/-1/-1->12->11 [6] 13/-1/-1->12->11 [7] 13/-1/-1->12->15 [8] 13/-1/-1->12->11 [9] 13/-1/-1->12->11 [10] 13/-1/-1->12->11 [11] 13/-1/-1->12->10 [12] 13/4/-1->12->-1 [13] -1/-1/-1->12->11 [14] 13/-1/-1->12->11 [15] 13/-1/-1->12->15 msm-h200-2:4015:4582 [6] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4014:4583 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->5 [6] -1/-1/-1->13->12 [7] 14/-1/-1->13->12 [8] 14/-1/-1->13->12 [9] 14/-1/-1->13->12 [10] 14/-1/-1->13->12 [11] 14/-1/-1->13->12 [12] 14/-1/-1->13->12 [13] 14/5/-1->13->-1 [14] -1/-1/-1->13->12 [15] 14/-1/-1->13->12 msm-h200-2:4013:4577 [4] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4014:4583 [5] NCCL INFO P2P Chunksize set to 131072 msm-h200-2:4010:4584 [1] NCCL INFO Channel 03/0 : 9[1] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 07/0 : 8[0] -> 12[4] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 11/0 : 9[1] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 15/0 : 8[0] -> 12[4] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 02/0 : 11[3] -> 2[2] [send] via NET/IB/2 msm-h200-2:4012:4594 [3] NCCL INFO Channel 10/0 : 11[3] -> 2[2] [send] via NET/IB/2 msm-h200-2:4011:4581 [2] NCCL INFO Channel 02/0 : 3[3] -> 10[2] [receive] via NET/IB/2 msm-h200-2:4011:4581 [2] NCCL INFO Channel 10/0 : 3[3] -> 10[2] [receive] via NET/IB/2 msm-h200-2:4011:4581 [2] NCCL INFO Channel 01/0 : 10[2] -> 1[1] [send] via NET/IB/3 msm-h200-2:4011:4581 [2] NCCL INFO Channel 09/0 : 10[2] -> 1[1] [send] via NET/IB/3 msm-h200-2:4014:4583 [5] NCCL INFO Channel 05/0 : 6[6] -> 13[5] [receive] via NET/IB/7 msm-h200-2:4014:4583 [5] NCCL INFO Channel 13/0 : 6[6] -> 13[5] [receive] via NET/IB/7 msm-h200-2:4015:4582 [6] NCCL INFO Channel 06/0 : 7[7] -> 14[6] [receive] via NET/IB/6 msm-h200-2:4012:4594 [3] NCCL INFO Channel 03/0 : 0[0] -> 11[3] [receive] via NET/IB/4 msm-h200-2:4014:4583 [5] NCCL INFO Channel 04/0 : 13[5] -> 4[4] [send] via NET/IB/5 msm-h200-2:4012:4594 [3] NCCL INFO Channel 11/0 : 0[0] -> 11[3] [receive] via NET/IB/4 msm-h200-2:4015:4582 [6] NCCL INFO Channel 14/0 : 7[7] -> 14[6] [receive] via NET/IB/6 msm-h200-2:4014:4583 [5] NCCL INFO Channel 12/0 : 13[5] -> 4[4] [send] via NET/IB/5 msm-h200-2:4015:4582 [6] NCCL INFO Channel 05/0 : 14[6] -> 5[5] [send] via NET/IB/7 msm-h200-2:4015:4582 [6] NCCL INFO Channel 13/0 : 14[6] -> 5[5] [send] via NET/IB/7 msm-h200-2:4010:4584 [1] NCCL INFO Channel 01/0 : 2[2] -> 9[1] [receive] via NET/IB/3 msm-h200-2:4010:4584 [1] NCCL INFO Channel 09/0 : 2[2] -> 9[1] [receive] via NET/IB/3 msm-h200-2:4010:4584 [1] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [send] via NET/IB/1 msm-h200-2:4010:4584 [1] NCCL INFO Channel 08/0 : 9[1] -> 0[0] [send] via NET/IB/1 msm-h200-2:4013:4577 [4] NCCL INFO Channel 04/0 : 5[5] -> 12[4] [receive] via NET/IB/5 msm-h200-2:4009:4557 [0] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [receive] via NET/IB/1 msm-h200-2:4013:4577 [4] NCCL INFO Channel 12/0 : 5[5] -> 12[4] [receive] via NET/IB/5 msm-h200-2:4009:4557 [0] NCCL INFO Channel 08/0 : 1[1] -> 8[0] [receive] via NET/IB/1 msm-h200-2:4009:4557 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 07/0 : 12[4] -> 7[7] [send] via NET/IB/8 msm-h200-2:4013:4577 [4] NCCL INFO Channel 15/0 : 12[4] -> 7[7] [send] via NET/IB/8 msm-h200-2:4009:4557 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 04/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 06/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 07/0 : 13[5] -> 11[3] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 08/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 15/0 : 13[5] -> 11[3] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 09/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 06/0 : 15[7] -> 6[6] [send] via NET/IB/6 msm-h200-2:4016:4580 [7] NCCL INFO Channel 14/0 : 15[7] -> 6[6] [send] via NET/IB/6 msm-h200-2:4009:4557 [0] NCCL INFO Channel 10/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 12/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 13/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 14/0 : 8[0] -> 15[7] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 04/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 05/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 07/0 : 4[4] -> 15[7] [receive] via NET/IB/8 msm-h200-2:4011:4581 [2] NCCL INFO Channel 06/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4009:4557 [0] NCCL INFO Channel 03/0 : 8[0] -> 3[3] [send] via NET/IB/4 msm-h200-2:4016:4580 [7] NCCL INFO Channel 15/0 : 4[4] -> 15[7] [receive] via NET/IB/8 msm-h200-2:4009:4557 [0] NCCL INFO Channel 11/0 : 8[0] -> 3[3] [send] via NET/IB/4 msm-h200-2:4011:4581 [2] NCCL INFO Channel 07/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 08/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 10/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 03/0 : 12[4] -> 8[0] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 11/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 11/0 : 12[4] -> 8[0] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 03/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 12/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 04/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 13/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 05/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 14/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 07/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4011:4581 [2] NCCL INFO Channel 15/0 : 10[2] -> 9[1] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 08/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 09/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 03/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 10/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 04/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 11/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 06/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 12/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 07/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 13/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 08/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4016:4580 [7] NCCL INFO Channel 15/0 : 15[7] -> 14[6] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 09/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 10/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 11/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 12/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 04/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 14/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 05/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4015:4582 [6] NCCL INFO Channel 15/0 : 14[6] -> 13[5] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 06/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 04/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 04/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 07/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 02/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 05/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 05/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 09/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 06/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 06/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 05/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 10/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 07/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 08/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 06/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 12/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 08/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 09/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 08/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 13/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 09/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 10/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 09/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 14/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 11/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 12/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 10/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4010:4584 [1] NCCL INFO Channel 15/0 : 9[1] -> 8[0] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 12/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 11/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 13/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 13/0 : 13[5] -> 12[4] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 13/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4013:4577 [4] NCCL INFO Channel 14/0 : 12[4] -> 11[3] via P2P/CUMEM msm-h200-2:4014:4583 [5] NCCL INFO Channel 14/0 : 13[5] -> 12[4] via P2P/CUMEM
msm-h200-2:4015:4654 [6] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4015:4654 [6] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4015:4654 [6] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4015:4654 [6] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4014:4656 [5] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4014:4656 [5] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4014:4656 [5] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4014:4656 [5] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4015:4654 [6] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4015:4654 [6] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4015:4654 [6] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4015:4654 [6] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4014:4656 [5] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4014:4656 [5] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4014:4656 [5] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4014:4656 [5] NCCL INFO transport/net.cc:683 -> 2 msm-h200-2:4012:4594 [3] NCCL INFO Channel 14/0 : 11[3] -> 10[2] via P2P/CUMEM msm-h200-2:4012:4594 [3] NCCL INFO Channel 15/0 : 11[3] -> 10[2] via P2P/CUMEM
msm-h200-2:4012:4652 [3] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4012:4652 [3] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4012:4652 [3] NCCL INFO transport/net_ib.cc:795 -> 2
msm-h200-2:4011:4650 [2] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4012:4652 [3] NCCL INFO transport/net.cc:683 -> 2 msm-h200-2:4011:4650 [2] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4011:4650 [2] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4011:4650 [2] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4012:4652 [3] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4012:4652 [3] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4012:4652 [3] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4012:4652 [3] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4011:4650 [2] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4011:4650 [2] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4011:4650 [2] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4011:4650 [2] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4013:4655 [4] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4013:4655 [4] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4013:4655 [4] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4013:4655 [4] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4013:4655 [4] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4013:4655 [4] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4013:4655 [4] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4013:4655 [4] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4010:4649 [1] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4010:4649 [1] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4010:4649 [1] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4010:4649 [1] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4010:4649 [1] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4010:4649 [1] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4010:4649 [1] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4010:4649 [1] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4016:4653 [7] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4016:4653 [7] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4016:4653 [7] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4016:4653 [7] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4016:4653 [7] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4016:4653 [7] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4016:4653 [7] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4016:4653 [7] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4009:4658 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4009:4658 [0] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4009:4658 [0] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4009:4658 [0] NCCL INFO transport/net.cc:683 -> 2 msm-h200-2:4011:4581 [2] NCCL INFO transport/net.cc:304 -> 2 msm-h200-2:4011:4581 [2] NCCL INFO transport.cc:165 -> 2 msm-h200-2:4011:4581 [2] NCCL INFO init.cc:1222 -> 2 msm-h200-2:4011:4581 [2] NCCL INFO init.cc:1501 -> 2 msm-h200-2:4011:4581 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
msm-h200-2:4011:4650 [2] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
msm-h200-2:4011:4650 [2] proxy.cc:1567 NCCL WARN [Proxy Service 10] Failed to execute operation Connect from rank 10, retcode 3 msm-h200-2:4011:4011 [2] NCCL INFO group.cc:418 -> 2 msm-h200-2:4011:4011 [2] NCCL INFO init.cc:1876 -> 2 msm-h200-2:4013:4577 [4] NCCL INFO transport/net.cc:304 -> 2 msm-h200-2:4013:4577 [4] NCCL INFO transport.cc:165 -> 2 msm-h200-2:4013:4577 [4] NCCL INFO init.cc:1222 -> 2 msm-h200-2:4013:4577 [4] NCCL INFO init.cc:1501 -> 2 msm-h200-2:4013:4577 [4] NCCL INFO group.cc:64 -> 2 [Async thread]
msm-h200-2:4013:4655 [4] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
msm-h200-2:4013:4655 [4] proxy.cc:1567 NCCL WARN [Proxy Service 12] Failed to execute operation Connect from rank 12, retcode 3 msm-h200-2:4013:4013 [4] NCCL INFO group.cc:418 -> 2 msm-h200-2:4013:4013 [4] NCCL INFO init.cc:1876 -> 2 msm-h200-2:4016:4580 [7] NCCL INFO transport/net.cc:304 -> 2 msm-h200-2:4016:4580 [7] NCCL INFO transport.cc:165 -> 2 msm-h200-2:4016:4580 [7] NCCL INFO init.cc:1222 -> 2 msm-h200-2:4016:4580 [7] NCCL INFO init.cc:1501 -> 2 msm-h200-2:4016:4580 [7] NCCL INFO group.cc:64 -> 2 [Async thread]
msm-h200-2:4016:4653 [7] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection msm-h200-2:4016:4016 [7] NCCL INFO group.cc:418 -> 2 msm-h200-2:4016:4016 [7] NCCL INFO init.cc:1876 -> 2
msm-h200-2:4016:4653 [7] proxy.cc:1567 NCCL WARN [Proxy Service 15] Failed to execute operation Connect from rank 15, retcode 3 msm-h200-2:4011:4011 [2] NCCL INFO comm 0x561c9922db90 rank 10 nranks 16 cudaDev 2 busId 65000 - Abort COMPLETE msm-h200-2:4013:4013 [4] NCCL INFO comm 0x5612a180cdd0 rank 12 nranks 16 cudaDev 4 busId a1000 - Abort COMPLETE msm-h200-2:4016:4016 [7] NCCL INFO comm 0x562a40b3c180 rank 15 nranks 16 cudaDev 7 busId a7000 - Abort COMPLETE
msm-h200-2:4009:4658 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory msm-h200-2:4009:4658 [0] NCCL INFO transport/net_ib.cc:659 -> 2 msm-h200-2:4009:4658 [0] NCCL INFO transport/net_ib.cc:795 -> 2 msm-h200-2:4009:4658 [0] NCCL INFO transport/net.cc:683 -> 2
msm-h200-2:4010:4584 [1] misc/ipcsocket.cc:221 NCCL WARN UDS: Sending data over socket /tmp/nccl-socket-10-fc8ff03bac1d8c64 failed : Connection refused (111) msm-h200-2:4010:4584 [1] NCCL INFO proxy.cc:1106 -> 2
msm-h200-2:4010:4584 [1] proxy.cc:1117 NCCL WARN ncclProxyCallBlockingUDS call to tpRank 10(fc8ff03bac1d8c64) failed : 2 msm-h200-2:4010:4584 [1] NCCL INFO proxy.cc:1127 -> 2
msm-h200-2:4010:4584 [1] proxy.cc:1135 NCCL WARN ncclProxyClientGetFd call to tpRank 10 handle 0x7fd04c042d40 failed : 2 msm-h200-2:4010:4584 [1] NCCL INFO transport/p2p.cc:246 -> 2 msm-h200-2:4010:4584 [1] NCCL INFO transport/p2p.cc:327 -> 2 msm-h200-2:4010:4584 [1] NCCL INFO transport/p2p.cc:507 -> 2 msm-h200-2:4010:4584 [1] NCCL INFO transport.cc:183 -> 2 msm-h200-2:4010:4584 [1] NCCL INFO init.cc:1222 -> 2 msm-h200-2:4010:4584 [1] NCCL INFO init.cc:1501 -> 2 msm-h200-2:4010:4584 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
msm-h200-2:4009:4557 [0] misc/ipcsocket.cc:221 NCCL WARN UDS: Sending data over socket /tmp/nccl-socket-12-1b066d02856b37b9 failed : Connection refused (111) msm-h200-2:4009:4557 [0] NCCL INFO proxy.cc:1106 -> 2
msm-h200-2:4009:4557 [0] proxy.cc:1117 NCCL WARN ncclProxyCallBlockingUDS call to tpRank 12(1b066d02856b37b9) failed : 2 msm-h200-2:4009:4557 [0] NCCL INFO proxy.cc:1127 -> 2
msm-h200-2:4009:4557 [0] proxy.cc:1135 NCCL WARN ncclProxyClientGetFd call to tpRank 12 handle 0x7fc3f0022580 failed : 2
msm-h200-2:4010:4649 [1] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection msm-h200-2:4009:4557 [0] NCCL INFO transport/p2p.cc:246 -> 2 msm-h200-2:4010:4010 [1] NCCL INFO group.cc:418 -> 2 msm-h200-2:4009:4557 [0] NCCL INFO transport/p2p.cc:327 -> 2 msm-h200-2:4010:4010 [1] NCCL INFO init.cc:1876 -> 2 msm-h200-2:4009:4557 [0] NCCL INFO transport/p2p.cc:507 -> 2 msm-h200-2:4009:4557 [0] NCCL INFO transport.cc:183 -> 2
msm-h200-2:4010:4649 [1] proxy.cc:1567 NCCL WARN [Proxy Service 9] Failed to execute operation Connect from rank 9, retcode 3 msm-h200-2:4009:4557 [0] NCCL INFO init.cc:1222 -> 2 msm-h200-2:4009:4557 [0] NCCL INFO init.cc:1501 -> 2 msm-h200-2:4009:4557 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
msm-h200-2:4009:4658 [0] proxy.cc:1533 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
msm-h200-2:4009:4658 [0] proxy.cc:1567 NCCL WARN [Proxy Service 8] Failed to execute operation Connect from rank 8, retcode 3
msm-h200-2:4009:4009 [0] NCCL INFO group.cc:418 -> 2
msm-h200-2:4009:4009 [0] NCCL INFO init.cc:1876 -> 2
terminate called after throwing an instance of 'EPException'
what(): Failed: CUDA error /sharedata/msm/workspace/DeepEP/csrc/deep_ep.cpp:89 'an illegal memory access was encountered'
terminate called after throwing an instance of 'EPException'
what(): Failed: CUDA error /sharedata/msm/workspace/DeepEP/csrc/deep_ep.cpp:89 'an illegal memory access was encountered'
terminate called after throwing an instance of 'EPException'
what(): Failed: CUDA error /sharedata/msm/workspace/DeepEP/csrc/deep_ep.cpp:89 'an illegal memory access was encountered'
W0509 14:28:06.553000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4009 via signal SIGTERM
W0509 14:28:06.553000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4010 via signal SIGTERM
W0509 14:28:06.554000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4012 via signal SIGTERM
W0509 14:28:06.556000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4013 via signal SIGTERM
W0509 14:28:06.556000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4014 via signal SIGTERM
W0509 14:28:06.556000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4015 via signal SIGTERM
W0509 14:28:06.557000 139961896613696 torch/multiprocessing/spawn.py:146] Terminating process 4016 via signal SIGTERM
Traceback (most recent call last):
File "/sharedata/msm/workspace/DeepEP/tests/test_internode.py", line 247, in
-- Process 2 terminated with the following error: Traceback (most recent call last): File "/home/aigc/miniforge3/envs/mamba/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap fn(i, *args) File "/sharedata/msm/workspace/DeepEP/tests/test_internode.py", line 229, in test_loop buffer = deep_ep.Buffer(group, int(1e9), int(1e9), low_latency_mode=test_ll_compatibility, File "/home/aigc/miniforge3/envs/mamba/lib/python3.10/site-packages/deep_ep-1.0.0+007fcfc-py3.10-linux-x86_64.egg/deep_ep/buffer.py", line 59, in init dist.all_gather_object(device_ids, local_device_id, group) File "/home/aigc/miniforge3/envs/mamba/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper return func(*args, **kwargs) File "/home/aigc/miniforge3/envs/mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2506, in all_gather_object all_gather(object_size_list, local_size, group=group) File "/home/aigc/miniforge3/envs/mamba/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper return func(*args, **kwargs) File "/home/aigc/miniforge3/envs/mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3108, in all_gather work = group.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: [Proxy Service 10] Failed to execute operation Connect from rank 10, retcode 3
msm-h200-2:4015:4654 [6] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
You should find out why ibv_create_cq failed to allocate memory.