ucx icon indicating copy to clipboard operation
ucx copied to clipboard

ucx hangs when compiling with nvhpc and anything above -O1, assertion triggered

Open haampie opened this issue 2 years ago • 1 comments

Describe the bug

When compiling ucx with NVHPC compilers v22.3, and using mlx5_0:1, ucx_perftest hangs, and when ^C'ing, the server triggers an assertion. The exact same setup with GCC works fine. Also with --enable-compiler-opt=1 the NVHPC works fine, but not with --enable-compiler-opt=2 or above.

The assertion is:

[1650460671.381413] [nid020048:12945:0]        perftest.c:129  UCX  ERROR recv() failed: Connection reset by peer
[nid020048:12945:0:12945]    perftest.c:273  Assertion `size <= max' failed

Steps to Reproduce

I've built ucx with Spack, using the following environment

Spack environment
spack:
  concretization: separately
  specs:
  # gcc version
  - osu-micro-benchmarks%gcc +cuda ^[email protected]:4 +cuda +cxx schedulers=slurm fabrics=ucx
    ^ucx +rdmacm +cma +verbs +xpmem +ib-hw-tm +mlx5-dv +dc +ud +rc +dm +optimizations
    +gdrcopy ~assertions ~debug ^cuda@:11.0

  # nvhpc version
  - osu-micro-benchmarks%nvhpc +cuda ^[email protected]:4%nvhpc +cuda +cxx schedulers=slurm
    fabrics=ucx ^ucx%nvhpc +rdmacm +cma +verbs +xpmem +ib-hw-tm +mlx5-dv +dc +ud +rc
    +dm +optimizations +gdrcopy +assertions ~debug
  view: false
  config:
    install_tree:
      root: /spack
  packages:
    openssl:
      variants: [certs=mozilla]
    libtool:
      externals:
      - spec: [email protected]
        prefix: /usr
    m4:
      externals:
      - spec: [email protected]
        prefix: /usr
    autoconf:
      externals:
      - spec: [email protected]
        prefix: /usr
    automake:
      externals:
      - spec: [email protected]
        prefix: /usr
    perl:
      externals:
      - spec: [email protected]~cpanm+shared+threads
        prefix: /usr
    slurm:
      externals:
      - spec: slurm@20-11-8-1
        prefix: /usr
    rdma-core:
      externals:
      - spec: [email protected]
        prefix: /usr
    xpmem:
      externals:
      - spec: [email protected]
        prefix: /opt/cray/xpmem/2.2.40-2.1_2.56__g3cf3325.shasta
  compilers:
  - compiler:
      spec: [email protected]
      paths:
        cc: /spack/linux-sles15-zen/gcc-7.5.0/gcc-9.4.0-fl2gp6kxlqfoydt3jogtr5pcus5loyx7/bin/gcc
        cxx: /spack/linux-sles15-zen/gcc-7.5.0/gcc-9.4.0-fl2gp6kxlqfoydt3jogtr5pcus5loyx7/bin/g++
        f77: /spack/linux-sles15-zen/gcc-7.5.0/gcc-9.4.0-fl2gp6kxlqfoydt3jogtr5pcus5loyx7/bin/gfortran
        fc: /spack/linux-sles15-zen/gcc-7.5.0/gcc-9.4.0-fl2gp6kxlqfoydt3jogtr5pcus5loyx7/bin/gfortran
      flags: {}
      operating_system: sles15
      target: x86_64
      modules: []
      environment: {}
      extra_rpaths: []
  - compiler:
      spec: [email protected]
      paths:
        cc: /spack/linux-sles15-zen2/gcc-9.4.0/nvhpc-22.3-m43i2j7uke6pwzxfkoytue7gordmtatg/Linux_x86_64/22.3/compilers/bin/nvc
        cxx: /spack/linux-sles15-zen2/gcc-9.4.0/nvhpc-22.3-m43i2j7uke6pwzxfkoytue7gordmtatg/Linux_x86_64/22.3/compilers/bin/nvc++
        f77: /spack/linux-sles15-zen2/gcc-9.4.0/nvhpc-22.3-m43i2j7uke6pwzxfkoytue7gordmtatg/Linux_x86_64/22.3/compilers/bin/nvfortran
        fc: /spack/linux-sles15-zen2/gcc-9.4.0/nvhpc-22.3-m43i2j7uke6pwzxfkoytue7gordmtatg/Linux_x86_64/22.3/compilers/bin/nvfortran
      flags: {}
      operating_system: sles15
      target: x86_64
      modules: []
      environment: {}
      extra_rpaths: []
Concretized environment:
Input spec
--------------------------------
osu-micro-benchmarks%nvhpc+cuda
    ^[email protected]:4%nvhpc+cuda+cxx fabrics=ucx schedulers=slurm
    ^ucx%nvhpc+assertions+cma+dc~debug+dm+gdrcopy+ib-hw-tm+mlx5-dv+optimizations+rc+rdmacm+ud+verbs+xpmem

Concretized
--------------------------------
[email protected]%[email protected]+cuda arch=linux-sles15-zen2
    ^[email protected]%[email protected]~allow-unsupported-compilers~dev arch=linux-sles15-zen2
        ^[email protected]%[email protected]~python patches=05ff238,10a88ad arch=linux-sles15-zen2
            ^[email protected]%[email protected] libs=shared,static arch=linux-sles15-zen2
            ^[email protected]%[email protected] arch=linux-sles15-zen2
            ^[email protected]%[email protected]~pic libs=shared,static arch=linux-sles15-zen2
            ^[email protected]%[email protected]+optimize+pic+shared patches=0d38234 arch=linux-sles15-zen2
    ^[email protected]%[email protected]~atomics+cuda+cxx~cxx_exceptions~gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker+pmi~pmix+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath cuda_arch=none fabrics=ucx patches=fba0d3a schedulers=slurm arch=linux-sles15-zen2
        ^[email protected]%[email protected]~cairo+cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared arch=linux-sles15-zen2
            ^[email protected]%[email protected] patches=6e08dc4 arch=linux-sles15-zen2
                ^[email protected]%[email protected] arch=linux-sles15-zen2
                ^[email protected]%[email protected] arch=linux-sles15-zen2
            ^[email protected]%[email protected]~symlinks+termlib abi=none patches=933af9e arch=linux-sles15-zen2
        ^[email protected]%[email protected]+openssl arch=linux-sles15-zen2
            ^[email protected]%[email protected]~docs~shared certs=mozilla arch=linux-sles15-zen2
                ^ca-certificates-mozilla@2022-03-29%[email protected] arch=linux-sles15-zen2
                ^[email protected]%[email protected]~cpanm+shared+threads patches=0eac10e,8cf4302 arch=linux-sles15-zen2
        ^[email protected]%[email protected] patches=4e1d78c,62fc8a8,ff37630 arch=linux-sles15-zen2
            ^[email protected]%[email protected] patches=7793209 arch=linux-sles15-zen2
            ^[email protected]%[email protected] arch=linux-sles15-zen2
            ^[email protected]%[email protected]+sigsegv patches=3877ab5,5746cf5,fc9b616 arch=linux-sles15-zen2
        ^[email protected]%[email protected] arch=linux-sles15-zen2
            ^[email protected]%[email protected] arch=linux-sles15-zen2
        ^slurm@20-11-8-1%[email protected]~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc arch=linux-sles15-zen2
        ^[email protected]%[email protected]+assertions~backtrace-detail+cma+cuda+dc~debug+dm+doc+examples+gdrcopy+ib-hw-tm~java~knem~logging+mlx5-dv+openmp+optimizations~parameter_checking+pic+rc+rdmacm~rocm+shared~static+thread_multiple~ucg+ud+verbs+xpmem cuda_arch=none opt=3 simd=auto arch=linux-sles15-zen2
            ^[email protected]%[email protected] patches=c5efec1 arch=linux-sles15-zen2
            ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo arch=linux-sles15-zen2
            ^[email protected]%[email protected]+kernel-module arch=linux-sles15-zen2
  • spack -e [env] install
  • server: UCX_NET_DEVICES=mlx5_0:1 srun -N1 -n1 --pty /spack/linux-sles15-zen2/nvhpc-22.3/ucx-1.12.1-4n2fet6aun5ilyfy4rxt2c247e4rajku/bin/ucx_perftest
  • client: UCX_NET_DEVICES=mlx5_0:1 srun -N1 -n1 --pty /spack/linux-sles15-zen2/nvhpc-22.3/ucx-1.12.1-4n2fet6aun5ilyfy4rxt2c247e4rajku/bin/ucx_perftest nid020048 -t ucp_put_lat
  • UCX configure flags:
/spack/linux-sles15-zen2/nvhpc-22.3/ucx-1.12.1-4n2fet6aun5ilyfy4rxt2c247e4rajku/bin/ucx_info -v
# UCT version=1.12.1 revision dc92435
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/spack/linux-sles15-zen2/nvhpc-22.3/ucx-1.12.1-4n2fet6aun5ilyfy4rxt2c247e4rajku --enable-mt --enable-cma --disable-params-check --enable-optimizations --enable-compiler-opt=3 --enable-assertions --disable-logging --disable-backtrace-detail --with-pic --with-rc --with-ud --with-dc --with-mlx5-dv --with-ib-hw-tm --with-dm --without-rocm --without-java --with-cuda=/apps/manali/UES/store/linux-sles15-zen2/nvhpc-22.3/cuda-11.6.2-a3layevrzuvgkl2anzqg3qpyrrevcrtz --with-gdrcopy=/spack/linux-sles15-zen2/nvhpc-22.3/gdrcopy-2.3-bm5yhjui33p3bpovoahya3f4dmzbuwh7 --without-knem --with-xpmem=/opt/cray/xpmem/2.2.40-2.1_2.56__g3cf3325.shasta --with-rdmacm=/usr --disable-static --enable-shared --disable-static --with-openmp --with-avx --with-verbs=/usr

Setup and versions

$ cat /etc/os-release 
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"
uname -a
Linux nid020000 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5) x86_64 x86_64 x86_64 GNU/Linux
$ rpm -q libibverbs
libibverbs-51mlnx1-1.51258.060.x86_64
$ ofed_info -s
OFED-internal-5.1-2.5.8.0.60:
$ ibv_devinfo -vv
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0040:a684:abf3:0000
	sys_image_guid:			0040:a684:abf3:0000
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			CRAY000000001
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				262144
	max_qp_wr:			32768
	device_cap_flags:		0xed721c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					RAW_IP_CSUM
					MANAGED_FLOW_STEERING
					Unknown flags: 0xC8400000
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				16777216
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		4194304
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	raw packet caps:
					C-VLAN stripping offload
					Scatter FCS offload
					IP csum offload
					Delay drop
	device_cap_flags_ex:		0x30000055ED721C36
					RAW_SCATTER_FCS
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004100000000
	tso_caps:
	max_tso:			262144
	supported_qp:
					SUPPORT_RAW_PACKET
	rss_caps:
		max_rwq_indirection_tables:			65536
		max_rwq_indirection_table_size:			2048
		rx_hash_function:				0x1
		rx_hash_fields_mask:				0x800000FF
		supported_qp:
					SUPPORT_RAW_PACKET
	max_wq_type_rq:			8388608
	packet_pacing_caps:
		qp_rate_limit_min:	1kbps
		qp_rate_limit_max:	100000000kbps
		supported_qp:
					SUPPORT_RAW_PACKET
	tag matching not supported

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	262144Bytes

		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet
			max_msg_sz:		0x40000000
			port_cap_flags:		0x04010000
			port_cap_flags2:	0x0000
			max_vl_num:		invalid value (0)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		1
			gid_tbl_len:		256
			subnet_timeout:		0
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		25.0 Gbps (32)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:0000:00ff:fe00:30b3, RoCE v1
			GID[  1]:		fe80::ff:fe00:30b3, RoCE v2
			GID[  2]:		0000:0000:0000:0000:0000:ffff:94bb:7454, RoCE v1
			GID[  3]:		::ffff:148.187.116.84, RoCE v2

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0040:a684:abe1:0000
	sys_image_guid:			0040:a684:abe1:0000
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			CRAY000000001
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				262144
	max_qp_wr:			32768
	device_cap_flags:		0xed721c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					RAW_IP_CSUM
					MANAGED_FLOW_STEERING
					Unknown flags: 0xC8400000
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				16777216
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		4194304
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	raw packet caps:
					C-VLAN stripping offload
					Scatter FCS offload
					IP csum offload
					Delay drop
	device_cap_flags_ex:		0x30000055ED721C36
					RAW_SCATTER_FCS
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004100000000
	tso_caps:
	max_tso:			262144
	supported_qp:
					SUPPORT_RAW_PACKET
	rss_caps:
		max_rwq_indirection_tables:			65536
		max_rwq_indirection_table_size:			2048
		rx_hash_function:				0x1
		rx_hash_fields_mask:				0x800000FF
		supported_qp:
					SUPPORT_RAW_PACKET
	max_wq_type_rq:			8388608
	packet_pacing_caps:
		qp_rate_limit_min:	1kbps
		qp_rate_limit_max:	100000000kbps
		supported_qp:
					SUPPORT_RAW_PACKET
	tag matching not supported

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	262144Bytes

		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet
			max_msg_sz:		0x40000000
			port_cap_flags:		0x04010000
			port_cap_flags2:	0x0000
			max_vl_num:		invalid value (0)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		1
			gid_tbl_len:		256
			subnet_timeout:		0
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		25.0 Gbps (32)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:0000:00ff:fe00:3033, RoCE v1
			GID[  1]:		fe80::ff:fe00:3033, RoCE v2
			GID[  2]:		0000:0000:0000:0000:0000:ffff:94bb:7434, RoCE v1
			GID[  3]:		::ffff:148.187.116.52, RoCE v2

hca_id:	mlx5_2
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0040:a684:abe2:0000
	sys_image_guid:			0040:a684:abe2:0000
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			CRAY000000001
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				262144
	max_qp_wr:			32768
	device_cap_flags:		0xed721c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					RAW_IP_CSUM
					MANAGED_FLOW_STEERING
					Unknown flags: 0xC8400000
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				16777216
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		4194304
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	raw packet caps:
					C-VLAN stripping offload
					Scatter FCS offload
					IP csum offload
					Delay drop
	device_cap_flags_ex:		0x30000055ED721C36
					RAW_SCATTER_FCS
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004100000000
	tso_caps:
	max_tso:			262144
	supported_qp:
					SUPPORT_RAW_PACKET
	rss_caps:
		max_rwq_indirection_tables:			65536
		max_rwq_indirection_table_size:			2048
		rx_hash_function:				0x1
		rx_hash_fields_mask:				0x800000FF
		supported_qp:
					SUPPORT_RAW_PACKET
	max_wq_type_rq:			8388608
	packet_pacing_caps:
		qp_rate_limit_min:	1kbps
		qp_rate_limit_max:	100000000kbps
		supported_qp:
					SUPPORT_RAW_PACKET
	tag matching not supported

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	262144Bytes

		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet
			max_msg_sz:		0x40000000
			port_cap_flags:		0x04010000
			port_cap_flags2:	0x0000
			max_vl_num:		invalid value (0)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		1
			gid_tbl_len:		256
			subnet_timeout:		0
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		25.0 Gbps (32)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:0000:00ff:fe00:3032, RoCE v1
			GID[  1]:		fe80::ff:fe00:3032, RoCE v2
			GID[  2]:		0000:0000:0000:0000:0000:ffff:94bb:7433, RoCE v1
			GID[  3]:		::ffff:148.187.116.51, RoCE v2

hca_id:	mlx5_3
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0040:a684:abf4:0000
	sys_image_guid:			0040:a684:abf4:0000
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			CRAY000000001
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				262144
	max_qp_wr:			32768
	device_cap_flags:		0xed721c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					RAW_IP_CSUM
					MANAGED_FLOW_STEERING
					Unknown flags: 0xC8400000
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				16777216
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		4194304
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	raw packet caps:
					C-VLAN stripping offload
					Scatter FCS offload
					IP csum offload
					Delay drop
	device_cap_flags_ex:		0x30000055ED721C36
					RAW_SCATTER_FCS
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004100000000
	tso_caps:
	max_tso:			262144
	supported_qp:
					SUPPORT_RAW_PACKET
	rss_caps:
		max_rwq_indirection_tables:			65536
		max_rwq_indirection_table_size:			2048
		rx_hash_function:				0x1
		rx_hash_fields_mask:				0x800000FF
		supported_qp:
					SUPPORT_RAW_PACKET
	max_wq_type_rq:			8388608
	packet_pacing_caps:
		qp_rate_limit_min:	1kbps
		qp_rate_limit_max:	100000000kbps
		supported_qp:
					SUPPORT_RAW_PACKET
	tag matching not supported

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	262144Bytes

		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet
			max_msg_sz:		0x40000000
			port_cap_flags:		0x04010000
			port_cap_flags2:	0x0000
			max_vl_num:		invalid value (0)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		1
			gid_tbl_len:		256
			subnet_timeout:		0
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		25.0 Gbps (32)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:0000:00ff:fe00:30b2, RoCE v1
			GID[  1]:		fe80::ff:fe00:30b2, RoCE v2
			GID[  2]:		0000:0000:0000:0000:0000:ffff:94bb:7453, RoCE v1
			GID[  3]:		::ffff:148.187.116.83, RoCE v2

haampie avatar Apr 20 '22 13:04 haampie

To complicate things, it does not always hang or trigger the assertion :(. Maybe undefined behavior / compiler bug?

haampie avatar Apr 20 '22 13:04 haampie