ompi icon indicating copy to clipboard operation
ompi copied to clipboard

mpirun nccl-test hang

Open 913871734 opened this issue 1 year ago • 1 comments

I compiled with MPI=1, but it hangs when i run the following:

ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 2 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

hang here I see that the process is busy all the time from top: 12405 ubuntu 20 0 540652 18376 10436 R 100.5 0.0 0:04.27 all_reduce_perf 12404 ubuntu 20 0 540652 18292 10352 R 100.0 0.0 0:04.27 all_reduce_perf

it does not hang if I run

ubuntu@ip-172-32-45-72:~/latest-drivers/nccl-tests$ /opt/amazon/openmpi/bin/mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
 nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1

 Using devices
   Rank  0 Pid  12275 on ip-172-32-45-72 device  0 [0x00] Tesla V100-SXM2-32GB

                                                     out-of-place                       in-place
       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum     4.43    0.00    0.00  0e+00     0.31    0.03    0.00  0e+00
          16             4   float     sum     4.34    0.00    0.00  0e+00     0.31    0.05    0.00  0e+00
          32             8   float     sum     4.30    0.01    0.00  0e+00     0.30    0.11    0.00  0e+00
          64            16   float     sum     4.36    0.01    0.00  0e+00     0.30    0.21    0.00  0e+00
         128            32   float     sum     4.27    0.03    0.00  0e+00     0.30    0.42    0.00  0e+00
         256            64   float     sum     4.26    0.06    0.00  0e+00     0.30    0.85    0.00  0e+00
         512           128   float     sum     4.36    0.12    0.00  0e+00     0.30    1.70    0.00  0e+00
        1024           256   float     sum     4.33    0.24    0.00  0e+00     0.30    3.38    0.00  0e+00
        2048           512   float     sum     4.32    0.47    0.00  0e+00     0.31    6.66    0.00  0e+00
        4096          1024   float     sum     4.27    0.96    0.00  0e+00     0.30   13.47    0.00  0e+00
        8192          2048   float     sum     4.36    1.88    0.00  0e+00     0.30   26.96    0.00  0e+00
       16384          4096   float     sum     4.30    3.81    0.00  0e+00     0.30   53.84    0.00  0e+00
       32768          8192   float     sum     4.27    7.67    0.00  0e+00     0.30  108.23    0.00  0e+00
       65536         16384   float     sum     4.36   15.04    0.00  0e+00     0.30  215.47    0.00  0e+00
      131072         32768   float     sum     4.35   30.13    0.00  0e+00     0.30  430.24    0.00  0e+00
      262144         65536   float     sum     4.37   59.96    0.00  0e+00     0.30  864.73    0.00  0e+00
      524288        131072   float     sum     4.34  120.92    0.00  0e+00     0.30  1728.04    0.00  0e+00
     1048576        262144   float     sum     6.10  171.98    0.00  0e+00     0.30  3444.73    0.00  0e+00
     2097152        524288   float     sum     8.78  238.96    0.00  0e+00     0.31  6866.90    0.00  0e+00
     4194304       1048576   float     sum    14.64  286.48    0.00  0e+00     0.30  13751.82    0.00  0e+00
     8388608       2097152   float     sum    25.98  322.90    0.00  0e+00     0.30  27786.05    0.00  0e+00
    16777216       4194304   float     sum    47.48  353.35    0.00  0e+00     0.30  55942.70    0.00  0e+00
    33554432       8388608   float     sum    90.58  370.45    0.00  0e+00     0.30  111107.39    0.00  0e+00
    67108864      16777216   float     sum    176.7  379.73    0.00  0e+00     0.30  221517.95    0.00  0e+00
   134217728      33554432   float     sum    349.2  384.30    0.00  0e+00     0.30  443841.69    0.00  0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0

I am able to ssh to localhost. I found that when i remove the localhost ip from the hostfile(replace it with other device's ip), the problem will not appear. but if i add the localhost's ip in the hostfile, the mpirun nccl-test will not work, will be hanging, and the cpu is high; I have tried many solutions to position the problem as followings:

  1. first, i use mpirun command to give a simple test: mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 hostname it is successful, and show as following: 10.xx.3.239 10.x.1.147

  2. I run the command: mpirun -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO --mca btl tcp,self --mca oob_tcp_if_include eth0 all_reduce_perf -b 8 -e 8G -g 1 -f 2 but i also failed, it is hanging without any print.

  3. Because there is no debug info, so i guess the problem is happened before the all_reduce_perf, the i tried to config more debug info for mpirun, as the command: mpirun -d --mca plm_base_verbose 5 --mca oob_base_verbose 10 --mca btl_base_verbose 30 -v --allow-run-as-root --mca orte_base_help_aggregate 0 -n 8 -N 4 --hostfile hostfile -x NCCL_IB_GID_INDEX=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 8G -g 1 -f 2 the debug info is:

[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:5 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:5 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 45 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 298
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 298 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 298 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/4
[10-43-1-147:1347858] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
[10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0/jf.43885
[10-43-1-147:1347858] top: /tmp/ompi.10-43-1-147.0
[10-43-1-147:1347858] tmp: /tmp
[10-43-1-147:1347859] procdir: /tmp/ompi.10-43-1-147.0/jf.43885/1/5
[10-43-1-147:1347859] jobdir: /tmp/ompi.10-43-1-147.0/jf.43885/1
[10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0/jf.43885
[10-43-1-147:1347859] top: /tmp/ompi.10-43-1-147.0
[10-43-1-147:1347859] tmp: /tmp
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
[10-43-1-147:1347858] mca: base: components_register: registering framework btl components
[10-43-1-147:1347858] mca: base: components_register: found loaded component self
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 195
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 195 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 120
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 120 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347859] mca: base: components_register: registering framework btl components
[10-43-1-147:1347859] mca: base: components_register: found loaded component self
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 395
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 395 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] mca: base: components_register: component self register function successful
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 195 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 120 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 395 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:2 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:2 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 493 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 493
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 493 BYTES FOR DEST [[43885,0],0] TAG 2
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 2 seq_num = -1
[10-43-1-147:1347858] mca: base: components_open: opening btl components
[10-43-1-147:1347858] mca: base: components_open: found loaded component self
[10-43-1-147:1347858] mca: base: components_open: component self open function successful
[10-43-1-147:1347858] select: initializing btl component self
[10-43-1-147:1347858] select: init of component self returned success
[10-43-1-147:1347859] mca: base: components_register: component self register function successful
[10-43-1-147:1347859] mca: base: components_open: opening btl components
[10-43-1-147:1347859] mca: base: components_open: found loaded component self
[10-43-1-147:1347859] mca: base: components_open: component self open function successful
[10-43-1-147:1347859] select: initializing btl component self
[10-43-1-147:1347859] select: init of component self returned success
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 5504
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 5504 BYTES FOR DEST [[43885,0],0] TAG 33
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 5504 BYTES ON SOCKET 18
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 1827 BYTES ON SOCKET 23
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 1827
[10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 1827 BYTES FOR DEST [[43885,0],1] TAG 15
[10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler called for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler CONNECTED
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate new recv msg
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler read hdr
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:tcp:recv:handler allocate data region of size 29
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] RECVD COMPLETE MESSAGE FROM [[43885,0],1] (ORIGIN [[43885,0],1]) OF 29 BYTES FOR DEST [[43885,0],0] TAG 33
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] DELIVERING TO RML tag = 33 seq_num = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] OOB_SEND: rml_oob_send.c:265
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send to target [[43885,0],1] - attempt 0
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:base:send known transport for peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] oob:tcp:send_nb to peer [[43885,0],1]:15 seq = -1
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:198] processing send to peer [[43885,0],1]:15 seq_num = -1 via [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_nb: already connected to [[43885,0],1] - queueing for send
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0]:[oob_tcp.c:208] queue send to [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler called to send to peer [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] tcp:send_handler SENDING TO [[43885,0],1]
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] oob:tcp:send_handler SENDING MSG
[dev-infra-rsjx02-h100-gpu8-10-43-3-239:3899938] [[43885,0],0] MESSAGE SEND COMPLETE TO [[43885,0],1] OF 80 BYTES ON SOCKET 23
[10-43-1-147:1347854] [[43885,0],1] OOB_SEND: rml_oob_send.c:265
[10-43-1-147:1347854] [[43885,0],1] oob:base:send to target [[43885,0],0] - attempt 0
[10-43-1-147:1347854] [[43885,0],1] oob:base:send known transport for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] oob:tcp:send_nb to peer [[43885,0],0]:33 seq = -1
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:198] processing send to peer [[43885,0],0]:33 seq_num = -1 via [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_nb: already connected to [[43885,0],0] - queueing for send
[10-43-1-147:1347854] [[43885,0],1]:[oob_tcp.c:208] queue send to [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler called to send to peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1] tcp:send_handler SENDING TO [[43885,0],0]
[10-43-1-147:1347854] oob:tcp:send_handler SENDING MSG
[10-43-1-147:1347854] [[43885,0],1] MESSAGE SEND COMPLETE TO [[43885,0],0] OF 29 BYTES ON SOCKET 18
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler called for peer [[43885,0],0]
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler CONNECTED
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate new recv msg
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler read hdr
[10-43-1-147:1347854] [[43885,0],1]:tcp:recv:handler allocate data region of size 80
[10-43-1-147:1347854] [[43885,0],1] RECVD COMPLETE MESSAGE FROM [[43885,0],0] (ORIGIN [[43885,0],0]) OF 80 BYTES FOR DEST [[43885,0],1] TAG 15
[10-43-1-147:1347854] [[43885,0],1] DELIVERING TO RML tag = 15 seq_num = -1

thus I check the mpi code, I found the condition cause the hanging is in the orterun.c :

    while (orte_event_base_active && completest.active) {          //orte_event_base_active :alway true, not false
        opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
    }
    ORTE_ACQUIRE_OBJECT(orte_event_base_active);

the variable orte_event_base_active is always true, haven't be changed to false. But I couldn't understand why the question happened.

At last, my env list is : cuda: 12.3 mpirun: 4.1.7a1 nccl: 2.20.3

please help me to resolve this question.

913871734 avatar Jun 14 '24 06:06 913871734

the link of same issue I have submit to nccl group is: https://github.com/NVIDIA/nccl-tests/issues/26

913871734 avatar Jun 14 '24 06:06 913871734

Just realized that ~~this~~ the original issue happened on AWS. I will take a look.

wenduwan avatar Jul 09 '24 16:07 wenduwan

@913871734 Sorry for the delayed response. To provide more information, could you kindly modify the issue following Open MPI's template? https://github.com/open-mpi/ompi/issues/new?assignees=&labels=&projects=&template=bug_report.md&title=

wenduwan avatar Jul 09 '24 16:07 wenduwan

Is this output complete ? According to it only two processes (out of 8) are opening their BTLs. So either it is incomplete or the other processes are blocked before that, or using a different communication PML. Add '--mca pml ob1 --mca pml_base_verbose 50 ' to the mix to see what you get.

bosilca avatar Jul 09 '24 22:07 bosilca

Is this output complete ? According to it only two processes (out of 8) are opening their BTLs. So either it is incomplete or the other processes are blocked before that, or using a different communication PML. Add '--mca pml ob1 --mca pml_base_verbose 50 ' to the mix to see what you get.

I followed your advice, and get the outputs:

root@ly-node-ip1-ip2-ip3-131:~# mpirun --mca pml ob1 --mca pml_base_verbose 50 --allow-run-as-root --hostfile hostfile -n 2 -N 1 -x NCCL_DEBUG=INFO all_reduce_perf -b 8 -e 128M -f 2 -g 1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: registering framework pml components
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: found loaded component ob1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_register: component ob1 register function successful
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: opening pml components
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: found loaded component ob1
[ly-node-ip1-ip2-ip3-131:3679492] mca: base: components_open: component ob1 open function successful
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: registering framework pml components
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: found loaded component ob1
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_register: component ob1 register function successful
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: opening pml components
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: found loaded component ob1
[ly-node-ip1-ip2-ip3-132:3046324] mca: base: components_open: component ob1 open function successful
[ly-node-ip1-ip2-ip3-132:3046324] select: initializing pml component ob1
[ly-node-ip1-ip2-ip3-132:3046324] select: init returned priority 20
[ly-node-ip1-ip2-ip3-132:3046324] selected ob1 best priority 20
[ly-node-ip1-ip2-ip3-132:3046324] select: component ob1 selected
[ly-node-ip1-ip2-ip3-131:3679492] select: initializing pml component ob1
[ly-node-ip1-ip2-ip3-131:3679492] select: init returned priority 20
[ly-node-ip1-ip2-ip3-131:3679492] selected ob1 best priority 20
[ly-node-ip1-ip2-ip3-131:3679492] select: component ob1 selected
[ly-node-ip1-ip2-ip3-132:3046324] check:select: checking my pml ob1 against process [[4120,1],0] pml ob1
[ly-node-ip1-ip2-ip3-131:3679492] check:select: PML check not necessary on self
[ly-node-ip1-ip2-ip3-132:3046324] *** An error occurred in MPI_Allgather
[ly-node-ip1-ip2-ip3-132:3046324] *** reported by process [270008321,1]
[ly-node-ip1-ip2-ip3-132:3046324] *** on communicator MPI_COMM_WORLD
[ly-node-ip1-ip2-ip3-132:3046324] *** MPI_ERR_INTERN: internal error
[ly-node-ip1-ip2-ip3-132:3046324] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ly-node-ip1-ip2-ip3-132:3046324] ***    and potentially your MPI job)
[ly-node-ip1-ip2-ip3-131:3679486] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[LOG_CAT_COMMPATTERNS]   isend failed in  comm_allreduce_pml at iterations 0

[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[ly-node-ip1-ip2-ip3-131:3679492] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[ly-node-ip1-ip2-ip3-131:3679486] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[ly-node-ip1-ip2-ip3-131:3679486] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

913871734 avatar Jul 10 '24 06:07 913871734

AWS provides instructions here to run NCCL tests https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html

In Step 9: Test your EFA and NCCL configuration it provides example commands with openmpi mpirun CLI. For example

/opt/amazon/openmpi/bin/mpirun \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
--hostfile my-hosts -n 8 -N 8 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

You might need to change NCCL/nccl-tests paths depending on your installation.

Could you give that a try?

wenduwan avatar Jul 10 '24 13:07 wenduwan

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Aug 08 '24 17:08 github-actions[bot]