ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Transport retry count exceeded on mlx5_0:1/IB -- uct_ib_mlx5_completion_with_err()

Open weiguangcui opened this issue 4 years ago • 8 comments

Describe the bug

A clear and concise description of what the bug is.

> [g07r1n14:27251:0:27251] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
> [g07r1n14:27251:0:27251] ib_mlx5_log.c:143  DCI QP 0x10f3a wqe[446]: SEND s-e [rqpn 0x32ca rlid 19352] [inl len 20]
> [i12r2n04:30536:0:30536] ib_mlx5_log.c:143  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
> [i12r2n04:30536:0:30536] ib_mlx5_log.c:143  DCI QP 0x17c58 wqe[384]: SEND s-e [rqpn 0x32ca rlid 19352] [inl len 20]
> 
> /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
>       ...
> 
> /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
>       ...
> 
> /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
>       ...
> 
> /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
> 
>       133     }
>       134 
>       135     ucs_log(log_level,
> ==>   136             "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
>       137             "%s QP 0x%x wqe[%d]: %s",
>       138             err_info, UCT_IB_IFACE_ARG(iface),
>       139             uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
>       133     }
>       134 
>       135     ucs_log(log_level,
> ==>   136             "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
>       137             "%s QP 0x%x wqe[%d]: %s",
>       138             err_info, UCT_IB_IFACE_ARG(iface),
>       139             uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
>       133     }
>       134 
>       135     ucs_log(log_level,
> ==>   136             "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
>       137             "%s QP 0x%x wqe[%d]: %s",
>       138             err_info, UCT_IB_IFACE_ARG(iface),
>       139             uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
>       133     }
>       134 
>       135     ucs_log(log_level,
> ==>   136             "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
>       137             "%s QP 0x%x wqe[%d]: %s",
>       138             err_info, UCT_IB_IFACE_ARG(iface),
>       139             uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
> 
>       133     }
>       134 
>       135     ucs_log(log_level,
> ==>   136             "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
>       137             "%s QP 0x%x wqe[%d]: %s",
>       138             err_info, UCT_IB_IFACE_ARG(iface),
>       139             uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
> ==== backtrace (tid:  25928) ====
>  0 0x00000000000576d3 ucs_debug_print_backtrace()  /public/home/weiguang/software/ucx-1.9.0/src/ucs/debug/debug.c:656
>  1 0x000000000002152c uct_ib_mlx5_completion_with_err()  /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5_log.c:136
>  2 0x0000000000061686 uct_ib_mlx5_poll_cq()  /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5.inl:81
>  3 0x0000000000061686 uct_dc_mlx5_iface_progress_tm()  /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/dc/dc_mlx5.c:261
>  4 0x000000000002ab6a ucs_callbackq_dispatch()  /public/home/weiguang/software/ucx-1.9.0/src/ucs/datastruct/callbackq.h:211
>  5 0x000000000002ab6a uct_worker_progress()  /public/home/weiguang/software/ucx-1.9.0/src/uct/api/uct.h:2346
>  6 0x000000000002ab6a ucp_worker_progress()  /public/home/weiguang/software/ucx-1.9.0/src/ucp/core/ucp_worker.c:2040
>  7 0x0000000000016246 hmca_bcol_ucx_p2p_progress_fast()  bcol_ucx_p2p_component.c:0
>  8 0x000000000008cfd9 hmca_bcol_ucx_p2p_allreduce_knomial_progress()  ???:0
>  9 0x0000000000022478 _coll_ml_allreduce()  coll_ml_allreduce.c:0
> 10 0x0000000000008343 mca_coll_hcoll_allreduce()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ompi-v4.0.4/ompi/mca/coll/hcoll/coll_hcoll_ops.c:228
> 11 0x000000000005ee4c PMPI_Allreduce()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ompi-v4.0.4/ompi/mpi/c/profile/pallreduce.c:113
> 12 0x000000000005ee4c opal_obj_update()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ompi-v4.0.4/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:513
> 13 0x000000000005ee4c PMPI_Allreduce()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ompi-v4.0.4/ompi/mpi/c/profile/pallreduce.c:116
> 14 0x000000000049453f fofrad_slab()  ???:0
> 15 0x000000000041632d calculate_non_standard_physics()  ???:0
> 16 0x0000000000419bb5 run()  ???:0
> 17 0x0000000000403ad1 main()  ???:0
> 18 0x00000000000223d5 __libc_start_main()  ???:0
> 
> [g07r1n14:27251] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2aecf85c05d0]
> [g07r1n14:27251] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aecf8803207]
> [g07r1n14:27251] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aecf88048f8]
> [g07r1n14:27251] [ 3] /public/home/weiguang/.local/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x2aed09993405]
> [g07r1n14:27251] [ 4] /public/home/weiguang/.local/lib/libucs.so.0(+0x5b094)[0x2aed09998094]
> [g07r1n14:27251] [ 5] /public/home/weiguang/.local/lib/libucs.so.0(ucs_log_dispatch+0xe1)[0x2aed099981e1]
> [g07r1n14:27251] [ 6] /public/home/weiguang/.local/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x24c)[0x2aed0a50e52c]
> [g07r1n14:27251] [ 7] /public/home/weiguang/.local/lib/ucx/libuct_ib.so.0(+0x61686)[0x2aed0a54e686]
> [g07r1n14:27251] [ 8] /public/home/weiguang/.local/lib/libucp.so.0(ucp_worker_progress+0x3a)[0x2aed0948cb6a]
> [g07r1n14:27251] [ 9] /public/home/weiguang/software/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x16246)[0x2aed46c27246]
> [g07r1n14:27251] [10] /public/home/weiguang/software/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/hcoll/lib/hcoll/hmca_bcol_ucx_p2p.so(hmca_bcol_ucx_p2p_allreduce_knomial_progress+0x519)[0x2aed46c9dfd9]
> [g07r1n14:27251] [11] /public/home/weiguang/software/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/hcoll/lib/libhcoll.so.1(+0x22478)[0x2aed0c9b9478]
> [g07r1n14:27251] [12] /public/home/weiguang/software/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ompi/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_allreduce+0x123)[0x2aed0c790343]
> [g07r1n14:27251] [13] /public/home/weiguang/software/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ompi/lib/libmpi.so.40(PMPI_Allreduce+0x6c)[0x2aecf80f5e4c]
> [g07r1n14:27251] [14] /public/home/weiguang/HELUCI/M4096_z0.1/./GIZMO[0x49453f]
> [g07r1n14:27251] [15] /public/home/weiguang/HELUCI/M4096_z0.1/./GIZMO[0x41632d]
> [g07r1n14:27251] [16] /public/home/weiguang/HELUCI/M4096_z0.1/./GIZMO[0x419bb5]
> [g07r1n14:27251] [17] /public/home/weiguang/HELUCI/M4096_z0.1/./GIZMO[0x403ad1]
> [g07r1n14:27251] [18] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aecf87ef3d5]
> [g07r1n14:27251] [19] /public/home/weiguang/HELUCI/M4096_z0.1/./GIZMO[0x403bec]
> [g07r1n14:27251] *** End of error message ***

Steps to Reproduce

  • Command line srun
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
  • Any UCX environment variables used UCT version=1.9.0 revision fcd1255 configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --with-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.0 --with-gdrcopy --prefix=/build-result/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/ucx

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a

CentOS Linux release 7.6.1810 (Core)

  • For RDMA/IB/RoCE related issues:
    • Driver version:

      • rpm -q rdma-core or rpm -q libibverbs libibverbs-41mlnx1-OFED.4.7.0.0.2.47100.x86_64
    • HW information from ibstat or ibv_devinfo -vv command hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.27.2008 node_guid: b859:9f03:0022:678a sys_image_guid: b859:9f03:0022:678a vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe17e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN XRC Unknown flags: 0xe16e0000 device_cap_exp_flags: 0x5648F8F100000000 EXP_DC_TRANSPORT EXP_CROSS_CHANNEL EXP_MR_ALLOCATE EXT_ATOMICS EXT_SEND NOP EXP_UMR EXP_ODP EXP_RX_CSUM_TCP_UDP_PKT EXP_RX_CSUM_IP_PKT EXP_DC_INFO EXP_MASKED_ATOMICS EXP_RX_TCP_UDP_PKT_TYPE EXP_PHYSICAL_RANGE_MR EXP_UMR_FIXED_SIZE EXP_PACKET_BASED_CREDIT_MODE Unknown flags: 0x200000000000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) log atomic arg sizes (mask) 0x8 masked_log_atomic_arg_sizes (mask) 0x3c masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 max fetch and add bit boundary 64 log max atomic inline 5 max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 hca_core_clock: 156250 max_klm_list_size: 65536 max_send_wqe_inline_klms: 20 max_umr_recursion_depth: 4 max_umr_stride_dimension: 1 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT max_size: 0xFFFFFFFFFFFFFFFF rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ_RECV uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND dc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ xrc_odp_caps: NO SUPPORT raw_eth_odp_caps: NO SUPPORT max_dct: 262144 max_device_ctx: 1020 Multi-Packet RQ supported Supported for objects type: IBV_EXP_MP_RQ_SUP_TYPE_SRQ_TM IBV_EXP_MP_RQ_SUP_TYPE_WQ_RQ Supported payload shifts: 2 bytes Log number of strides for single WQE: 3 - 16 Log number of bytes in single stride: 6 - 13 rx_pad_end_addr_align: 64 tso_caps: max_tso: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps ooo_caps: ooo_rc_caps = 0x1 ooo_xrc_caps = 0x1 ooo_dc_caps = 0x1 ooo_ud_caps = 0x0 SUPPORT_RC_RW_DATA_PLACEMENT SUPPORT_XRC_RW_DATA_PLACEMENT SUPPORT_DC_RW_DATA_PLACEMENT sw_parsing_caps: supported_qp: max_rndv_hdr_size: 0x40 max_num_tags: 0x7f max_ops: 0x8000 max_sge: 0x1 capability_flags: IBV_EXP_TM_CAP_RC IBV_EXP_TM_CAP_DC tunnel_offloads_caps: UMR fixed size: max entity size: 2147483648 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 8 port_lid: 9457 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x2251e848 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 8 subnet_timeout: 18 init_type_reply: 0 active_width: invalid widthX (16) active_speed: 50.0 Gbps (64) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:b859:9f03:0022:678a

Additional information (depending on the issue)

  • OpenMPI version (Open MPI) 4.0.4rc3

  • Output of ucx_info -d to show transports and devices recognized by UCX

> #
> # Memory domain: posix
> #     Component: posix
> #             allocate: unlimited
> #           remote key: 24 bytes
> #           rkey_ptr is supported
> #
> #   Transport: posix
> #      Device: memory
> #
> #      capabilities:
> #            bandwidth: 0.00/ppn + 12179.00 MB/sec
> #              latency: 80 nsec
> #             overhead: 10 nsec
> #            put_short: <= 4294967295
> #            put_bcopy: unlimited
> #            get_bcopy: unlimited
> #             am_short: <= 100
> #             am_bcopy: <= 8256
> #               domain: cpu
> #           atomic_add: 32, 64 bit
> #           atomic_and: 32, 64 bit
> #            atomic_or: 32, 64 bit
> #           atomic_xor: 32, 64 bit
> #          atomic_fadd: 32, 64 bit
> #          atomic_fand: 32, 64 bit
> #           atomic_for: 32, 64 bit
> #          atomic_fxor: 32, 64 bit
> #          atomic_swap: 32, 64 bit
> #         atomic_cswap: 32, 64 bit
> #           connection: to iface
> #      device priority: 0
> #     device num paths: 1
> #              max eps: inf
> #       device address: 8 bytes
> #        iface address: 8 bytes
> #       error handling: none
> #
> #
> # Memory domain: sysv
> #     Component: sysv
> #             allocate: unlimited
> #           remote key: 12 bytes
> #           rkey_ptr is supported
> #
> #   Transport: sysv
> #      Device: memory
> #
> #      capabilities:
> #            bandwidth: 0.00/ppn + 12179.00 MB/sec
> #              latency: 80 nsec
> #             overhead: 10 nsec
> #            put_short: <= 4294967295
> #            put_bcopy: unlimited
> #            get_bcopy: unlimited
> #             am_short: <= 100
> #             am_bcopy: <= 8256
> #               domain: cpu
> #           atomic_add: 32, 64 bit
> #           atomic_and: 32, 64 bit
> #            atomic_or: 32, 64 bit
> #           atomic_xor: 32, 64 bit
> #          atomic_fadd: 32, 64 bit
> #          atomic_fand: 32, 64 bit
> #           atomic_for: 32, 64 bit
> #          atomic_fxor: 32, 64 bit
> #          atomic_swap: 32, 64 bit
> #         atomic_cswap: 32, 64 bit
> #           connection: to iface
> #      device priority: 0
> #     device num paths: 1
> #              max eps: inf
> #       device address: 8 bytes
> #        iface address: 8 bytes
> #       error handling: none
> #
> #
> # Memory domain: self
> #     Component: self
> #             register: unlimited, cost: 0 nsec
> #           remote key: 0 bytes
> #
> #   Transport: self
> #      Device: memory
> #
> #      capabilities:
> #            bandwidth: 0.00/ppn + 6911.00 MB/sec
> #              latency: 0 nsec
> #             overhead: 10 nsec
> #            put_short: <= 4294967295
> #            put_bcopy: unlimited
> #            get_bcopy: unlimited
> #             am_short: <= 8K
> #             am_bcopy: <= 8K
> #               domain: cpu
> #           atomic_add: 32, 64 bit
> #           atomic_and: 32, 64 bit
> #            atomic_or: 32, 64 bit
> #           atomic_xor: 32, 64 bit
> #          atomic_fadd: 32, 64 bit
> #          atomic_fand: 32, 64 bit
> #           atomic_for: 32, 64 bit
> #          atomic_fxor: 32, 64 bit
> #          atomic_swap: 32, 64 bit
> #         atomic_cswap: 32, 64 bit
> #           connection: to iface
> #      device priority: 0
> #     device num paths: 1
> #              max eps: inf
> #       device address: 0 bytes
> #        iface address: 8 bytes
> #       error handling: none
> #
> #
> # Memory domain: tcp
> #     Component: tcp
> #             register: unlimited, cost: 0 nsec
> #           remote key: 0 bytes
> #
> #   Transport: tcp
> #      Device: enp97s0f0
> #
> #      capabilities:
> #            bandwidth: 1131.64/ppn + 0.00 MB/sec
> #              latency: 5258 nsec
> #             overhead: 50000 nsec
> #            put_zcopy: <= 18446744073709551590, up to 6 iov
> #  put_opt_zcopy_align: <= 1
> #        put_align_mtu: <= 0
> #             am_short: <= 8K
> #             am_bcopy: <= 8K
> #             am_zcopy: <= 64K, up to 6 iov
> #   am_opt_zcopy_align: <= 1
> #         am_align_mtu: <= 0
> #            am header: <= 8037
> #           connection: to iface
> #      device priority: 1
> #     device num paths: 1
> #              max eps: 256
> #       device address: 4 bytes
> #        iface address: 2 bytes
> #       error handling: none
> #
> #   Transport: tcp
> #      Device: enp3s0f0
> #
> #      capabilities:
> #            bandwidth: 113.16/ppn + 0.00 MB/sec
> #              latency: 5776 nsec
> #             overhead: 50000 nsec
> #            put_zcopy: <= 18446744073709551590, up to 6 iov
> #  put_opt_zcopy_align: <= 1
> #        put_align_mtu: <= 0
> #             am_short: <= 8K
> #             am_bcopy: <= 8K
> #             am_zcopy: <= 64K, up to 6 iov
> #   am_opt_zcopy_align: <= 1
> #         am_align_mtu: <= 0
> #            am header: <= 8037
> #           connection: to iface
> #      device priority: 1
> #     device num paths: 1
> #              max eps: 256
> #       device address: 4 bytes
> #        iface address: 2 bytes
> #       error handling: none
> #
> #   Transport: tcp
> #      Device: ib0
> #
> #      capabilities:
> #            bandwidth: 11142.51/ppn + 0.00 MB/sec
> #              latency: 5206 nsec
> #             overhead: 50000 nsec
> #            put_zcopy: <= 18446744073709551590, up to 6 iov
> #  put_opt_zcopy_align: <= 1
> #        put_align_mtu: <= 0
> #             am_short: <= 8K
> #             am_bcopy: <= 8K
> #             am_zcopy: <= 64K, up to 6 iov
> #   am_opt_zcopy_align: <= 1
> #         am_align_mtu: <= 0
> #            am header: <= 8037
> #           connection: to iface
> #      device priority: 1
> #     device num paths: 1
> #              max eps: 256
> #       device address: 4 bytes
> #        iface address: 2 bytes
> #       error handling: none
> #
> #
> # Connection manager: tcp
> #      max_conn_priv: 2032 bytes
> #
> # Memory domain: sockcm
> #     Component: sockcm
> #           supports client-server connection establishment via sockaddr
> #   < no supported devices found >
> #
> # Memory domain: mlx5_0
> #     Component: ib
> #             register: unlimited, cost: 180 nsec
> #           remote key: 8 bytes
> #           local memory handle is required for zcopy
> #
> #   Transport: rc_verbs
> #      Device: mlx5_0:1
> #
> #      capabilities:
> #            bandwidth: 3480.93/ppn + 0.00 MB/sec
> #              latency: 600 + 1.000 * N nsec
> #             overhead: 75 nsec
> #            put_short: <= 124
> #            put_bcopy: <= 8256
> #            put_zcopy: <= 1G, up to 8 iov
> #  put_opt_zcopy_align: <= 512
> #        put_align_mtu: <= 4K
> #            get_bcopy: <= 8256
> #            get_zcopy: 65..1G, up to 8 iov
> #  get_opt_zcopy_align: <= 512
> #        get_align_mtu: <= 4K
> #             am_short: <= 123
> #             am_bcopy: <= 8255
> #             am_zcopy: <= 8255, up to 7 iov
> #   am_opt_zcopy_align: <= 512
> #         am_align_mtu: <= 4K
> #            am header: <= 127
> #               domain: device
> #           atomic_add: 64 bit
> #          atomic_fadd: 64 bit
> #         atomic_cswap: 64 bit
> #           connection: to ep
> #      device priority: 50
> #     device num paths: 1
> #              max eps: 256
> #       device address: 3 bytes
> #           ep address: 5 bytes
> #       error handling: peer failure
> #
> #
> #   Transport: rc_mlx5
> #      Device: mlx5_0:1
> #
> #      capabilities:
> #            bandwidth: 3480.93/ppn + 0.00 MB/sec
> #              latency: 600 + 1.000 * N nsec
> #             overhead: 40 nsec
> #            put_short: <= 2K
> #            put_bcopy: <= 8256
> #            put_zcopy: <= 1G, up to 14 iov
> #  put_opt_zcopy_align: <= 512
> #        put_align_mtu: <= 4K
> #            get_bcopy: <= 8256
> #            get_zcopy: 65..1G, up to 14 iov
> #  get_opt_zcopy_align: <= 512
> #        get_align_mtu: <= 4K
> #             am_short: <= 2046
> #             am_bcopy: <= 8254
> #             am_zcopy: <= 8254, up to 3 iov
> #   am_opt_zcopy_align: <= 512
> #         am_align_mtu: <= 4K
> #            am header: <= 186
> #               domain: device
> #           atomic_add: 32, 64 bit
> #           atomic_and: 32, 64 bit
> #            atomic_or: 32, 64 bit
> #           atomic_xor: 32, 64 bit
> #          atomic_fadd: 32, 64 bit
> #          atomic_fand: 32, 64 bit
> #           atomic_for: 32, 64 bit
> #          atomic_fxor: 32, 64 bit
> #          atomic_swap: 32, 64 bit
> #         atomic_cswap: 32, 64 bit
> #           connection: to ep
> #      device priority: 50
> #     device num paths: 1
> #              max eps: 256
> #       device address: 3 bytes
> #           ep address: 7 bytes
> #       error handling: buffer (zcopy), remote access, peer failure
> #
> #
> #   Transport: dc_mlx5
> #      Device: mlx5_0:1
> #
> #      capabilities:
> #            bandwidth: 3480.93/ppn + 0.00 MB/sec
> #              latency: 660 nsec
> #             overhead: 40 nsec
> #            put_short: <= 2K
> #            put_bcopy: <= 8256
> #            put_zcopy: <= 1G, up to 11 iov
> #  put_opt_zcopy_align: <= 512
> #        put_align_mtu: <= 4K
> #            get_bcopy: <= 8256
> #            get_zcopy: 65..1G, up to 11 iov
> #  get_opt_zcopy_align: <= 512
> #        get_align_mtu: <= 4K
> #             am_short: <= 2046
> #             am_bcopy: <= 8254
> #             am_zcopy: <= 8254, up to 3 iov
> #   am_opt_zcopy_align: <= 512
> #         am_align_mtu: <= 4K
> #            am header: <= 138
> #               domain: device
> #           atomic_add: 32, 64 bit
> #           atomic_and: 32, 64 bit
> #            atomic_or: 32, 64 bit
> #           atomic_xor: 32, 64 bit
> #          atomic_fadd: 32, 64 bit
> #          atomic_fand: 32, 64 bit
> #           atomic_for: 32, 64 bit
> #          atomic_fxor: 32, 64 bit
> #          atomic_swap: 32, 64 bit
> #         atomic_cswap: 32, 64 bit
> #           connection: to iface
> #      device priority: 50
> #     device num paths: 1
> #              max eps: inf
> #       device address: 3 bytes
> #        iface address: 5 bytes
> #       error handling: buffer (zcopy), remote access, peer failure
> #
> #
> #   Transport: ud_verbs
> #      Device: mlx5_0:1
> #
> #      capabilities:
> #            bandwidth: 3480.93/ppn + 0.00 MB/sec
> #              latency: 630 nsec
> #             overhead: 105 nsec
> #             am_short: <= 116
> #             am_bcopy: <= 4088
> #             am_zcopy: <= 4088, up to 7 iov
> #   am_opt_zcopy_align: <= 512
> #         am_align_mtu: <= 4K
> #            am header: <= 3952
> #           connection: to ep, to iface
> #      device priority: 50
> #     device num paths: 1
> #              max eps: inf
> #       device address: 3 bytes
> #        iface address: 3 bytes
> #           ep address: 6 bytes
> #       error handling: peer failure
> #
> #
> #   Transport: ud_mlx5
> #      Device: mlx5_0:1
> #
> #      capabilities:
> #            bandwidth: 3480.93/ppn + 0.00 MB/sec
> #              latency: 630 nsec
> #             overhead: 80 nsec
> #             am_short: <= 180
> #             am_bcopy: <= 4088
> #             am_zcopy: <= 4088, up to 3 iov
> #   am_opt_zcopy_align: <= 512
> #         am_align_mtu: <= 4K
> #            am header: <= 132
> #           connection: to ep, to iface
> #      device priority: 50
> #     device num paths: 1
> #              max eps: inf
> #       device address: 3 bytes
> #        iface address: 3 bytes
> #           ep address: 6 bytes
> #       error handling: peer failure
> #
> #
> #   Transport: cm
> #      Device: mlx5_0:1
> #
> #      capabilities:
> #            bandwidth: 3480.93/ppn + 0.00 MB/sec
> #              latency: 600 nsec
> #             overhead: 1200 nsec
> #             am_bcopy: <= 214
> #           connection: to iface
> #      device priority: 50
> #     device num paths: 1
> #              max eps: inf
> #       device address: 3 bytes
> #        iface address: 4 bytes
> #       error handling: none
> #
> #
> # Memory domain: rdmacm
> #     Component: rdmacm
> #           supports client-server connection establishment via sockaddr
> #   < no supported devices found >
> #
> # Connection manager: rdmacm
> #      max_conn_priv: 54 bytes
> #
> # Memory domain: cma
> #     Component: cma
> #             register: unlimited, cost: 9 nsec
> #
> #   Transport: cma
> #      Device: memory
> #
> #      capabilities:
> #            bandwidth: 0.00/ppn + 11145.00 MB/sec
> #              latency: 80 nsec
> #             overhead: 400 nsec
> #            put_zcopy: unlimited, up to 16 iov
> #  put_opt_zcopy_align: <= 1
> #        put_align_mtu: <= 1
> #            get_zcopy: unlimited, up to 16 iov
> #  get_opt_zcopy_align: <= 1
> #        get_align_mtu: <= 1
> #           connection: to iface
> #      device priority: 0
> #     device num paths: 1
> #              max eps: inf
> #       device address: 8 bytes
> #        iface address: 4 bytes
> #       error handling: none
> #
> #
> # Memory domain: knem
> #     Component: knem
> #             register: unlimited, cost: 180 nsec
> #           remote key: 16 bytes
> #
> #   Transport: knem
> #      Device: memory
> #
> #      capabilities:
> #            bandwidth: 13862.00/ppn + 0.00 MB/sec
> #              latency: 80 nsec
> #             overhead: 250 nsec
> #            put_zcopy: unlimited, up to 16 iov
> #  put_opt_zcopy_align: <= 1
> #        put_align_mtu: <= 1
> #            get_zcopy: unlimited, up to 16 iov
> #  get_opt_zcopy_align: <= 1
> #        get_align_mtu: <= 1
> #           connection: to iface
> #      device priority: 0
> #     device num paths: 1
> #              max eps: inf
> #       device address: 8 bytes
> #        iface address: 0 bytes
> #       error handling: none
> #

weiguangcui avatar Apr 15 '21 15:04 weiguangcui

@weiguangcui

  1. Are you setting any UCX environment variables? specifically, UCX_RC_MLX5_TM_ENABLE=y or UCX_DC_MLX5_TM_ENABLE=y?
  2. Can you pls try adding "-mca coll ^hcoll" to MPI command line?
  3. The FW, MLNX_OFED, and HPC-X versions are pretty old. Would it be possible to upgrade to MLNX_OFED 5.2-2.2.0.0 and HPC-X 2.8.1 ? this will also upgrade the FW (and requires reboot)

yosefe avatar Apr 16 '21 11:04 yosefe

  1. not in my job submission bash script.
  2. not possible, I only restricted to use srun, which doesn't accept that parameter.
  3. I have asked the admin, which seems unlikely to do that in a short time...

weiguangcui avatar Apr 16 '21 13:04 weiguangcui

  1. Can u pls try to set OMPI_MCA_coll=^hcoll environment variable for srun?

yosefe avatar Apr 16 '21 13:04 yosefe

Just get confirmed from the admin, both UCX_RC_MLX5_TM_ENABLE and UCX_DC_MLX5_TM_ENABLE are set to n by default. Does that matter? should I set them to y?

weiguangcui avatar Apr 16 '21 14:04 weiguangcui

Just get confirmed from the admin, both UCX_RC_MLX5_TM_ENABLE and UCX_DC_MLX5_TM_ENABLE are set to n by default. Does that matter? should I set them to y?

No need to set to 'y', just trying to narrow down the issue: According to the backtrace, seems HW tag matching was enabled.

>  2 0x0000000000061686 uct_ib_mlx5_poll_cq()  /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/mlx5/ib_mlx5.inl:81
>  3 0x0000000000061686 uct_dc_mlx5_iface_progress_tm()  /public/home/weiguang/software/ucx-1.9.0/src/uct/ib/dc/dc_mlx5.c:261
>  4 0x000000000002ab6a ucs_callbackq_dispatch()  /public/home/weiguang/software/ucx-1.9.0/src/ucs/datastruct/callbackq.h:211

Also, what is the scale of the job? Is it possible to try adding these env vars:

UCX_RC_MLX5_TX_NUM_GET_BYTES=256k
UCX_RC_MLX5_MAX_GET_ZCOPY=32k

yosefe avatar Apr 16 '21 14:04 yosefe

The simulation is very large and running with 2048 nodes (32cpus per node) I will try to add these parameters

weiguangcui avatar Apr 16 '21 14:04 weiguangcui

Was there a solution for this?

ildar avatar May 15 '23 13:05 ildar

Was there a solution for this?

vikaskurapati avatar Sep 05 '25 13:09 vikaskurapati