mpirun nccl-test hang
I compiled with MPI=1, but it hangs when i run the following:
ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 2 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
hang here
I see that the process is busy all the time from top:
12405 ubuntu 20 0 540652 18376 10436 R 100.5 0.0 0:04.27 all_reduce_perf 12404 ubuntu 20 0 540652 18292 10352 R 100.0 0.0 0:04.27 all_reduce_perf
it does not hang if I run
ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
Using devices
Rank 0 Pid 12275 on ip-172-32-45-72 device 0 [0x00] Tesla V100-SXM2-32GB
out-of-place in-place
size count type redop time algbw busbw error time algbw busbw error
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 4.43 0.00 0.00 0e+00 0.31 0.03 0.00 0e+00
16 4 float sum 4.34 0.00 0.00 0e+00 0.31 0.05 0.00 0e+00
32 8 float sum 4.30 0.01 0.00 0e+00 0.30 0.11 0.00 0e+00
64 16 float sum 4.36 0.01 0.00 0e+00 0.30 0.21 0.00 0e+00
128 32 float sum 4.27 0.03 0.00 0e+00 0.30 0.42 0.00 0e+00
256 64 float sum 4.26 0.06 0.00 0e+00 0.30 0.85 0.00 0e+00
512 128 float sum 4.36 0.12 0.00 0e+00 0.30 1.70 0.00 0e+00
1024 256 float sum 4.33 0.24 0.00 0e+00 0.30 3.38 0.00 0e+00
2048 512 float sum 4.32 0.47 0.00 0e+00 0.31 6.66 0.00 0e+00
4096 1024 float sum 4.27 0.96 0.00 0e+00 0.30 13.47 0.00 0e+00
8192 2048 float sum 4.36 1.88 0.00 0e+00 0.30 26.96 0.00 0e+00
16384 4096 float sum 4.30 3.81 0.00 0e+00 0.30 53.84 0.00 0e+00
32768 8192 float sum 4.27 7.67 0.00 0e+00 0.30 108.23 0.00 0e+00
65536 16384 float sum 4.36 15.04 0.00 0e+00 0.30 215.47 0.00 0e+00
131072 32768 float sum 4.35 30.13 0.00 0e+00 0.30 430.24 0.00 0e+00
262144 65536 float sum 4.37 59.96 0.00 0e+00 0.30 864.73 0.00 0e+00
524288 131072 float sum 4.34 120.92 0.00 0e+00 0.30 1728.04 0.00 0e+00
1048576 262144 float sum 6.10 171.98 0.00 0e+00 0.30 3444.73 0.00 0e+00
2097152 524288 float sum 8.78 238.96 0.00 0e+00 0.31 6866.90 0.00 0e+00
4194304 1048576 float sum 14.64 286.48 0.00 0e+00 0.30 13751.82 0.00 0e+00
8388608 2097152 float sum 25.98 322.90 0.00 0e+00 0.30 27786.05 0.00 0e+00
16777216 4194304 float sum 47.48 353.35 0.00 0e+00 0.30 55942.70 0.00 0e+00
33554432 8388608 float sum 90.58 370.45 0.00 0e+00 0.30 111107.39 0.00 0e+00
67108864 16777216 float sum 176.7 379.73 0.00 0e+00 0.30 221517.95 0.00 0e+00
134217728 33554432 float sum 349.2 384.30 0.00 0e+00 0.30 443841.69 0.00 0e+00
Out of bounds values : 0 OK
Avg bus bandwidth : 0
I am able to ssh to localhost. I found that when i remove the localhost ip from the hostfile(replace it with other device's ip), the problem will not appear. but if i add the localhost's ip in the hostfile, the mpirun nccl-test will not work, will be hanging, and the cpu is high; I have tried many solutions to position the problem as followings:
-
first, i use mpirun command to give a simple test:
mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 hostnameit is successful, and show as following:10.xx.3.239 10.x.1.147 -
I run the command:
mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO --mca btl tcp,self --mca oob_tcp_if_include eth0 all_reduce_perf -b 8 -e 8G -g 1 -f 2but i also failed, it is hanging without any print. -
Because there is no debug info, so i guess the problem is happened before the all_reduce_perf, the i tried to config more debug info for mpirun, as the command:
mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 8G -g 1 -f 2the debug info is:
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:5 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:5 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 45 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/4
[10-43-1-147:1347858] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
[10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0/jf.43885
[10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0
[10-43-1-147:1347858] tmp: /tmp
[10-43-1-147:1347859] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/5
[10-43-1-147:1347859] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
[10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0/jf.43885
[10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0
[10-43-1-147:1347859] tmp: /tmp
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
[10-43-1-147:1347858] mca: base: components_register: registering framework btl components
[10-43-1-147:1347858] mca: base: components_register: found loaded component self
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 120
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 120 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347859] mca: base: components_register: registering framework btl components
[10-43-1-147:1347859] mca: base: components_register: found loaded component self
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 395
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 395 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] mca: base: components_register: component self register function successful
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 120 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 395 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 493 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 493
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 493 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] mca: base: components_open: opening btl components
[10-43-1-147:1347858] mca: base: components_open: found loaded component self
[10-43-1-147:1347858] mca: base: components_open: component self open function successful
[10-43-1-147:1347858] select: initializing btl component self
[10-43-1-147:1347858] select: init of component self returned success
[10-43-1-147:1347859] mca: base: components_register: component self register function successful
[10-43-1-147:1347859] mca: base: components_open: opening btl components
[10-43-1-147:1347859] mca: base: components_open: found loaded component self
[10-43-1-147:1347859] mca: base: components_open: component self open function successful
[10-43-1-147:1347859] select: initializing btl component self
[10-43-1-147:1347859] select: init of component self returned success
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 5504
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 5504 BYTES FOR DEST [[43885,0],0] TAG 33
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 5504 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 1827 BYTES ON SOCKET 23
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 1827
[10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 1827 BYTES FOR DEST [[43885,0],1] TAG 15
[10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 29
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 29 BYTES FOR DEST [[43885,0],0] TAG 33
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 80 BYTES ON SOCKET 23
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 29 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 80
[10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 80 BYTES FOR DEST [[43885,0],1] TAG 15
[10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1
thus I check the mpi code, I found the condition cause the hanging is in the orterun.c :
while (orte_event_base_active && completest.active) { //orte_event_base_active :alway true, not false
opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
}
ORTE_ACQUIRE_OBJECT(orte_event_base_active);
the variable orte_event_base_active is always true, haven't be changed to false. But I couldn't understand why the question happened.
At last, my env list is : cuda: 12.3 mpirun: 4.1.7a1 nccl: 2.20.3
please help me to resolve this question.
the link of same issue I have submit to nccl group is: https://github.com/NVIDIA/nccl-tests/issues/26
Just realized that ~~this~~ the original issue happened on AWS. I will take a look.
@913871734 Sorry for the delayed response. To provide more information, could you kindly modify the issue following Open MPI's template? https://github.com/open-mpi/ompi/issues/new?assignees=&labels=&projects=&template=bug_report.md&title=
Is this output complete ? According to it only two processes (out of 8) are opening their BTLs. So either it is incomplete or the other processes are blocked before that, or using a different communication PML. Add '--mca pml ob1 --mca pml_base_verbose 50 ' to the mix to see what you get.
Is this output complete ? According to it only two processes (out of 8) are opening their BTLs. So either it is incomplete or the other processes are blocked before that, or using a different communication PML. Add '--mca pml ob1 --mca pml_base_verbose 50 ' to the mix to see what you get.
I followed your advice, and get the outputs:
root@ly-node-ip1-ip2-ip3-131:~# mpirun --mca pml ob1 --mca pml_base_verbose 50 --allow-run-as-root --hostfile hostfile -n 2 -N 1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 128M -f 2 -g 1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: registering framework pml components
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: found loaded component ob1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: component ob1 register function successful
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: opening pml components
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: found loaded component ob1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: component ob1 open function successful
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: registering framework pml components
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: found loaded component ob1
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: component ob1 register function successful
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: opening pml components
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: found loaded component ob1
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: component ob1 open function successful
[ly-node-ip1-ip2-ip3-132:3046324] select: initializing pml component ob1
[ly-node-ip1-ip2-ip3-132:3046324] select: init returned priority 20
[ly-node-ip1-ip2-ip3-132:3046324] selected ob1 best priority 20
[ly-node-ip1-ip2-ip3-132:3046324] select: component ob1 selected
[ly-node-ip1-ip2-ip3-131:3679492] select: initializing pml component ob1
[ly-node-ip1-ip2-ip3-131:3679492] select: init returned priority 20
[ly-node-ip1-ip2-ip3-131:3679492] selected ob1 best priority 20
[ly-node-ip1-ip2-ip3-131:3679492] select: component ob1 selected
[ly-node-ip1-ip2-ip3-132:3046324] check:select: checking my pml ob1 against process [[4120,1],0] pml ob1
[ly-node-ip1-ip2-ip3-131:3679492] check:select: PML check not necessary on self
[ly-node-ip1-ip2-ip3-132:3046324] *** An error occurred in MPI_Allgather
[ly-node-ip1-ip2-ip3-132:3046324] *** reported by process [270008321,1]
[ly-node-ip1-ip2-ip3-132:3046324] *** on communicator MPI_COMM_WORLD
[ly-node-ip1-ip2-ip3-132:3046324] *** MPI_ERR_INTERN: internal error
[ly-node-ip1-ip2-ip3-132:3046324] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ly-node-ip1-ip2-ip3-132:3046324] *** and potentially your MPI job)
[ly-node-ip1-ip2-ip3-131:3679486] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 0
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[ly-node-ip1-ip2-ip3-131:3679492] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[ly-node-ip1-ip2-ip3-131:3679486] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[ly-node-ip1-ip2-ip3-131:3679486] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
AWS provides instructions here to run NCCL tests https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html
In Step 9: Test your EFA and NCCL configuration it provides example commands with openmpi mpirun CLI. For example
/opt/amazon/openmpi/bin/mpirun \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--hostfile my-hosts -n 8 -N 8 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
You might need to change NCCL/nccl-tests paths depending on your installation.
Could you give that a try?
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.