ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Core dumps from ucx_info -d

Open cniethammer opened this issue 2 years ago • 3 comments

Describe the bug

Segfaults/core dumps when running ucx_info -d

Steps to Reproduce

  • ucx 1.13.0 release installation with ./configure --prefix=$HOME/bin/ucx-1.13.0
ucx_info -v
# Version 1.13.0
# Git branch '', revision 6765970
# Configured with: --prefix=$HOME/bin/ucx-1.13.0

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • CentOS Linux release 8.4.2105, Linux XXX 4.18.0-305.25.1.el8_4.x86_64 x86_64 x86_64 x86_64 GNU/Linux
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • libibverbs-41mlnx1-OFED.4.9.3.0.0.49417.x86_64
      • MLNX_OFED_LINUX-4.9-4.1.7.0
    • HW information from ibstat command
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.32.1010
        Hardware version: 0
        Node GUID: 0xb88303ffff8dbf40
        System image GUID: 0xb88303ffff8dbf40
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 337
                LMC: 0
                SM lid: 12170
                Capability mask: 0x2651e848
                Port GUID: 0xb88303ffff8dbf40
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.32.1010
        Hardware version: 0
        Node GUID: 0xb88303ffff8dbf42
        System image GUID: 0xb88303ffff8dbf40
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 18
                LMC: 0
                SM lid: 12170
                Capability mask: 0x2641e848
                Port GUID: 0xb88303ffff8dbf42
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.28.1002
        Hardware version: 0
        Node GUID: 0xb88303ffffa3e29c
        System image GUID: 0xb88303ffffa3e29c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xba8303fffea3e29c
                Link layer: Ethernet
CA 'mlx5_3'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.28.1002
        Hardware version: 0
        Node GUID: 0xb88303ffffa3e29d
        System image GUID: 0xb88303ffffa3e29c
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Capability mask: 0x2651e848
                Port GUID: 0xb88303ffffa3e29d
                Link layer: InfiniBand

Additional information (depending on the issue)

  • Output of ucx_info -d and error message:
ib_exp_md.c:700  Assertion `!md->super.relaxed_order' failed

$HOME/src/ucx-1.13.0/src/uct/ib/mlx5/exp/ib_exp_md.c: [ uct_ib_mlx5_exp_md_open() ]
      ...
      697     md->super.ops = &uct_ib_mlx5_md_ops;
      698
      699     uct_ib_md_parse_relaxed_order(&md->super, md_config);
==>   700     ucs_assert(!md->super.relaxed_order);
      701     md->super.memh_struct_size = sizeof(uct_ib_mlx5_mem_t),
      702
      703     status = uct_ib_md_open_common(&md->super, ibv_device, md_config);

==== backtrace (tid:2160133) ====
 0 0x000000000002baa8 uct_ib_mlx5_exp_md_open()  $HOME/src/ucx-1.13.0/src/uct/ib/mlx5/exp/ib_exp_md.c:700
 1 0x0000000000025c2e uct_ib_md_open()  $HOME/src/ucx-1.13.0/src/uct/ib/base/ib_md.c:1547
 2 0x00000000000128ab uct_md_open()  $HOME/src/ucx-1.13.0/src/uct/base/uct_md.c:81
 3 0x0000000000406605 print_md_info()  $HOME/src/ucx-1.13.0/src/tools/info/tl_info.c:403
 4 0x0000000000406ce2 print_uct_component_info()  $HOME/src/ucx-1.13.0/src/tools/info/tl_info.c:591
 5 0x0000000000406e0a print_uct_info()  $HOME/src/ucx-1.13.0/src/tools/info/tl_info.c:617
 6 0x0000000000402a5c main()  $HOME/src/ucx-1.13.0/src/tools/info/ucx_info.c:261
 7 0x0000000000023493 __libc_start_main()  ???:0
 8 0x0000000000402b8e _start()  ???:0
=================================

Trying to debug I ran with valgrind - and with this it works just fine and outputs ib infos:

valgrind $HOME/bin/ucx_info -d
...
# Memory domain: mlx5_1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: dc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (0)
#
#      capabilities:
#            bandwidth: 11794.23/ppn + 0.00 MB/sec
#              latency: 660 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 50
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: rc_verbs
#         Device: mlx5_1:1
#           Type: network
...

cniethammer avatar Jul 26 '22 15:07 cniethammer

Hi, could you please post cat /proc/cpuinfo and ucx_info -b Also could you please set environment variable UCX_IB_PCI_RELAXED_ORDERING=no and run ucx_info -d, this supposed to be workaround for the issue

Artemy-Mellanox avatar Jul 27 '22 12:07 Artemy-Mellanox

Thanks for the response.

First, setting UCX_IB_PCI_RELAXED_ORDERING=no helped and prevents the segfault.

And here the requested output:

Output from ucx_info -b

#define UCX_CONFIG_H              
#define ENABLE_ASSERT             1
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 0
#define ENABLE_PARAMS_CHECK       1
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPLUS_DEMANGLE       1
#define HAVE_CPU_SET_T            1
#define HAVE_DC_EXP               1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 1
#define HAVE_DECL_BFD_GET_SECTION_VMA 1
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 1
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_FUSE_MOUNT      0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT    0
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_GETAUXVAL       1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0
#define HAVE_DECL_IBV_ADVISE_MR   0
#define HAVE_DECL_IBV_ALLOC_DM    0
#define HAVE_DECL_IBV_CMD_MODIFY_QP 1
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_CREATE_CQ_EX 0
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_EXP_ALLOC_DM 1
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 1
#define HAVE_DECL_IBV_EXP_CREATE_QP 1
#define HAVE_DECL_IBV_EXP_CREATE_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 1
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_DESTROY_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_EXP_POST_SEND 1
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 1
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 1
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_RES_DOMAIN 1
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1
#define HAVE_DECL_IBV_EXP_REG_MR  1
#define HAVE_DECL_IBV_EXP_RES_DOMAIN_THREAD_MODEL 1
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1
#define HAVE_DECL_IBV_EXP_SETENV  1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1
#define HAVE_DECL_IBV_EXP_WR_NOP  1
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_CQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_QP_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_GET_SRQ_INFO 1
#define HAVE_DECL_IBV_MLX5_EXP_UPDATE_CQ_CI 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT    1
#define HAVE_DECL_IN_ATTRIB       1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5_WQE_CTRL_SOLICITED 1
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE 0
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DETAILED_BACKTRACE   1
#define HAVE_DLFCN_H              1
#define HAVE_EXP_UMR              1
#define HAVE_EXP_UMR_KSM          1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IBV_EXP_DM           1
#define HAVE_IBV_EXP_QP_CREATE_UMR 1
#define HAVE_IBV_EXP_RES_DOMAIN   1
#define HAVE_IB_EXT_ATOMICS       1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5_HW_H 1
#define HAVE_INOTIFY              1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBBFD               1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_HOOK          1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MASKED_ATOMICS_ENDIANNESS 1
#define HAVE_MEMALIGN             1
#define HAVE_MEMORY_H             1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1
#define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1
#define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1
#define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1
#define HAVE_STRUCT_IBV_MLX5_QP_INFO_BF_NEED_LOCK 1
#define HAVE_STRUCT_MLX5_AH_IBV_AH 1
#define HAVE_STRUCT_MLX5_CQE64_IB_STRIDE_INDEX 1
#define HAVE_STRUCT_MLX5_GRH_AV_RMAC 1
#define HAVE_STRUCT_MLX5_SRQ_CMD_QP 1
#define HAVE_STRUCT_MLX5_WQE_AV_BASE 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE_VERBS_EXP_H          1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.13"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.13"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_TRACE_POLL
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--prefix=$HOME/bin/ucx-1.13.0"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.13"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucm_MODULES               ""
#define ucs_MODULES               ""
#define uct_MODULES               ":ib:rdmacm:cma:knem"
#define uct_cuda_MODULES          ""
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ""

Output from cat /proc/cpuinfo:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7702 64-Core Processor
stepping	: 0
microcode	: 0x830104d
cpu MHz		: 1796.051
cache size	: 512 KB
physical id	: 0
siblings	: 128
core id		: 0
cpu cores	: 64
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3992.82
TLB size	: 3072 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7702 64-Core Processor
stepping	: 0
microcode	: 0x830104d
cpu MHz		: 1794.426
cache size	: 512 KB
physical id	: 0
siblings	: 128
core id		: 1
cpu cores	: 64
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 3992.82
TLB size	: 3072 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
...

cniethammer avatar Jul 27 '22 13:07 cniethammer

Seems related to #8216 ?

tonycurtis avatar Jul 27 '22 14:07 tonycurtis

I am not sure if this was addressed looking at the change log history. However, I would like to report, that I cannot reproduce the issue with UCX 1.14.0.

cniethammer avatar Jun 01 '23 11:06 cniethammer