ucx icon indicating copy to clipboard operation
ucx copied to clipboard

GPU->GPU performance issue

Open Yiltan opened this issue 2 years ago • 2 comments

Describe the bug

I get around ~12000MB/s for inter-node GPU->GPU data transfers on a ConnectX6 200Gbit. I get around ~24000MB for the same test using host memory. Should the difference be this large?

+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]               209      0.360  4951.277  4951.277    12925.96   12925.96         202         202
[thread 0]               401      0.370  5385.354  5159.115    11884.08   12405.23         186         194
[thread 0]               577      0.370  5691.551  5321.522    11244.74   12026.64         176         188
[thread 0]               753      0.370  5683.977  5406.239    11259.72   11838.17         176         185
[thread 0]               929      0.370  5684.204  5458.900    11259.27   11723.97         176         183
[thread 0]              1121      0.370  5383.027  5445.905    11889.22   11751.95         186         184
[thread 0]              1313      0.370  5390.026  5437.733    11873.78   11769.61         186         184
............

Steps to Reproduce

  • Command line

    • Node A: UCX_NET_DEVICES=mlx5_0:1 ucx_perftest -t tag_bw -m cuda -s $((64 * 1024 * 1024))
    • Node B: UCX_NET_DEVICES=mlx5_0:1 ucx_perftest nodeA -t tag_bw -m cuda -s $((64 * 1024 * 1024))
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)

# Version 1.13.1
# Git branch '', revision 09f27c0
# Configured with: --prefix=/home/username/build --with-cuda=/opt/cuda/11.7.0/ --enable-mt --without-knem --with-verbs
  • Any UCX environment variables used

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • Rocky Linux 9.3
    • 5.14.0-362.13.1.el9_3.x86_64
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • 23.10-1.1.9.0
    • HW information from ibstat or ibv_devinfo -vv command
[root@daisy00 ~]# ibv_devinfo -vv
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				20.39.2048
	node_guid:			b83f:d203:00a6:d0cc
	sys_image_guid:			b83f:d203:00a6:d0cc
	vendor_id:			0x02c9
	vendor_part_id:			4123
	hw_ver:				0x0
	board_id:			MT_0000000225
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				131072
	max_qp_wr:			32768
	device_cap_flags:		0x21361c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					UD_IP_CSUM
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					MANAGED_FLOW_STEERING
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				8388608
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		2097152
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	device_cap_flags_ex:		0x3000005021361C36
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004000000000
	tso_caps:
		max_tso:			0
	rss_caps:
		max_rwq_indirection_tables:			0
		max_rwq_indirection_table_size:			0
		rx_hash_function:				0x0
		rx_hash_fields_mask:				0x0
	max_wq_type_rq:			0
	packet_pacing_caps:
		qp_rate_limit_min:	0kbps
		qp_rate_limit_max:	0kbps
	max_rndv_hdr_size:		64
	max_num_tags:			127
	max_ops:			32768
	max_sge:			1
	flags:
					IBV_TM_CAP_RC

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	131072Bytes

	num_comp_vectors:		63
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		8
			port_lmc:		0x00
			link_layer:		InfiniBand
			max_msg_sz:		0x40000000
			port_cap_flags:		0xa251e84a
			port_cap_flags2:	0x0032
			max_vl_num:		4 (3)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		128
			gid_tbl_len:		8
			subnet_timeout:		18
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		50.0 Gbps (64)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:b83f:d203:00a6:d0cc

hca_id:	mlx5_1
	transport:			InfiniBand (0)
	fw_ver:				20.39.2048
	node_guid:			b83f:d203:00a6:d0cd
	sys_image_guid:			b83f:d203:00a6:d0cc
	vendor_id:			0x02c9
	vendor_part_id:			4123
	hw_ver:				0x0
	board_id:			MT_0000000225
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				131072
	max_qp_wr:			32768
	device_cap_flags:		0x21361c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					UD_IP_CSUM
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					MANAGED_FLOW_STEERING
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				8388608
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		2097152
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	device_cap_flags_ex:		0x3000005021361C36
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004000000000
	tso_caps:
		max_tso:			0
	rss_caps:
		max_rwq_indirection_tables:			0
		max_rwq_indirection_table_size:			0
		rx_hash_function:				0x0
		rx_hash_fields_mask:				0x0
	max_wq_type_rq:			0
	packet_pacing_caps:
		qp_rate_limit_min:	0kbps
		qp_rate_limit_max:	0kbps
	max_rndv_hdr_size:		64
	max_num_tags:			127
	max_ops:			32768
	max_sge:			1
	flags:
					IBV_TM_CAP_RC

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	131072Bytes

	num_comp_vectors:		63
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		9
			port_lmc:		0x00
			link_layer:		InfiniBand
			max_msg_sz:		0x40000000
			port_cap_flags:		0xa251e848
			port_cap_flags2:	0x0032
			max_vl_num:		4 (3)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		128
			gid_tbl_len:		8
			subnet_timeout:		18
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		50.0 Gbps (64)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:b83f:d203:00a6:d0cd

hca_id:	mlx5_2
	transport:			InfiniBand (0)
	fw_ver:				20.39.2048
	node_guid:			b83f:d203:00a6:c784
	sys_image_guid:			b83f:d203:00a6:c784
	vendor_id:			0x02c9
	vendor_part_id:			4123
	hw_ver:				0x0
	board_id:			MT_0000000225
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				131072
	max_qp_wr:			32768
	device_cap_flags:		0x21361c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					UD_IP_CSUM
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					MANAGED_FLOW_STEERING
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				8388608
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		2097152
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	device_cap_flags_ex:		0x3000005021361C36
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004000000000
	tso_caps:
		max_tso:			0
	rss_caps:
		max_rwq_indirection_tables:			0
		max_rwq_indirection_table_size:			0
		rx_hash_function:				0x0
		rx_hash_fields_mask:				0x0
	max_wq_type_rq:			0
	packet_pacing_caps:
		qp_rate_limit_min:	0kbps
		qp_rate_limit_max:	0kbps
	max_rndv_hdr_size:		64
	max_num_tags:			127
	max_ops:			32768
	max_sge:			1
	flags:
					IBV_TM_CAP_RC

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	131072Bytes

	num_comp_vectors:		63
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		6
			port_lmc:		0x00
			link_layer:		InfiniBand
			max_msg_sz:		0x40000000
			port_cap_flags:		0xa251e848
			port_cap_flags2:	0x0032
			max_vl_num:		4 (3)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		128
			gid_tbl_len:		8
			subnet_timeout:		18
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		50.0 Gbps (64)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:b83f:d203:00a6:c784

hca_id:	mlx5_3
	transport:			InfiniBand (0)
	fw_ver:				20.39.2048
	node_guid:			b83f:d203:00a6:c785
	sys_image_guid:			b83f:d203:00a6:c784
	vendor_id:			0x02c9
	vendor_part_id:			4123
	hw_ver:				0x0
	board_id:			MT_0000000225
	phys_port_cnt:			1
	max_mr_size:			0xffffffffffffffff
	page_size_cap:			0xfffffffffffff000
	max_qp:				131072
	max_qp_wr:			32768
	device_cap_flags:		0x21361c36
					BAD_PKEY_CNTR
					BAD_QKEY_CNTR
					AUTO_PATH_MIG
					CHANGE_PHY_PORT
					PORT_ACTIVE_EVENT
					SYS_IMAGE_GUID
					RC_RNR_NAK_GEN
					MEM_WINDOW
					UD_IP_CSUM
					XRC
					MEM_MGT_EXTENSIONS
					MEM_WINDOW_TYPE_2B
					MANAGED_FLOW_STEERING
	max_sge:			30
	max_sge_rd:			30
	max_cq:				16777216
	max_cqe:			4194303
	max_mr:				16777216
	max_pd:				8388608
	max_qp_rd_atom:			16
	max_ee_rd_atom:			0
	max_res_rd_atom:		2097152
	max_qp_init_rd_atom:		16
	max_ee_init_rd_atom:		0
	atomic_cap:			ATOMIC_HCA (1)
	max_ee:				0
	max_rdd:			0
	max_mw:				16777216
	max_raw_ipv6_qp:		0
	max_raw_ethy_qp:		0
	max_mcast_grp:			2097152
	max_mcast_qp_attach:		240
	max_total_mcast_qp_attach:	503316480
	max_ah:				2147483647
	max_fmr:			0
	max_srq:			8388608
	max_srq_wr:			32767
	max_srq_sge:			31
	max_pkeys:			128
	local_ca_ack_delay:		16
	general_odp_caps:
					ODP_SUPPORT
					ODP_SUPPORT_IMPLICIT
	rc_odp_caps:
					SUPPORT_SEND
					SUPPORT_RECV
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	uc_odp_caps:
					NO SUPPORT
	ud_odp_caps:
					SUPPORT_SEND
	xrc_odp_caps:
					SUPPORT_SEND
					SUPPORT_WRITE
					SUPPORT_READ
					SUPPORT_ATOMIC
					SUPPORT_SRQ
	completion timestamp_mask:			0x7fffffffffffffff
	hca_core_clock:			156250kHZ
	device_cap_flags_ex:		0x3000005021361C36
					PCI_WRITE_END_PADDING
					Unknown flags: 0x3000004000000000
	tso_caps:
		max_tso:			0
	rss_caps:
		max_rwq_indirection_tables:			0
		max_rwq_indirection_table_size:			0
		rx_hash_function:				0x0
		rx_hash_fields_mask:				0x0
	max_wq_type_rq:			0
	packet_pacing_caps:
		qp_rate_limit_min:	0kbps
		qp_rate_limit_max:	0kbps
	max_rndv_hdr_size:		64
	max_num_tags:			127
	max_ops:			32768
	max_sge:			1
	flags:
					IBV_TM_CAP_RC

	cq moderation caps:
		max_cq_count:	65535
		max_cq_period:	4095 us

	maximum available device memory:	131072Bytes

	num_comp_vectors:		63
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		7
			port_lmc:		0x00
			link_layer:		InfiniBand
			max_msg_sz:		0x40000000
			port_cap_flags:		0xa251e848
			port_cap_flags2:	0x0032
			max_vl_num:		4 (3)
			bad_pkey_cntr:		0x0
			qkey_viol_cntr:		0x0
			sm_sl:			0
			pkey_tbl_len:		128
			gid_tbl_len:		8
			subnet_timeout:		18
			init_type_reply:	0
			active_width:		4X (2)
			active_speed:		50.0 Gbps (64)
			phys_state:		LINK_UP (5)
			GID[  0]:		fe80:0000:0000:0000:b83f:d203:00a6:c785
  • For GPU related issues:
    • GPU type
[root@daisy00 ~]# nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV4	SYS	SYS	SYS	SYS	SYS	SYS	0,2,4,6,8,10	0		N/A
GPU1	NV4	 X 	SYS	SYS	SYS	SYS	SYS	SYS	0,2,4,6,8,10	0		N/A
GPU2	SYS	SYS	 X 	NV4	SYS	SYS	SYS	SYS	1,3,5,7,9,11	1		N/A
GPU3	SYS	SYS	NV4	 X 	SYS	SYS	SYS	SYS	1,3,5,7,9,11	1		N/A
NIC0	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS				
NIC1	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX				
NIC3	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

  • Cuda:
    • Drivers version
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A30                     On  | 00000000:17:00.0 Off |                    0 |
| N/A   27C    P0              29W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A30                     On  | 00000000:65:00.0 Off |                    0 |
| N/A   26C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A30                     On  | 00000000:CA:00.0 Off |                    0 |
| N/A   27C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A30                     On  | 00000000:E3:00.0 Off |                    0 |
| N/A   28C    P0              30W / 165W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

  • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
nvidia_peermem         24576  0
ib_core               573440  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia               7794688  20 nvidia_uvm,nvidia_peermem,nvidia_modeset

Additional information (depending on the issue)

  • OpenMPI version
  • Output of ucx_info -d to show transports and devices recognized by UCX
  • Configure result - config.log
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

Yiltan avatar Jul 12 '23 22:07 Yiltan

@Yiltan What is the CPU model and PCIe level? maybe p2p PCIe bandwidth is limited

yosefe avatar Jul 13 '23 11:07 yosefe

CPU Version is : Intel(R) Xeon(R) Gold 6338 CPU and PCIe is Gen4 (confirmed with lspci that we have x16 and that each line is 16GT/s)

IOMMU is disabled and we have the following PCI Access Control Services

[admin@daisy01 ~]$ sudo lspci -vvv | grep ACSCtl
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

If there is anything else I can provide, I'd be happy to do so

Yiltan avatar Jan 09 '24 17:01 Yiltan

Hi @Yiltan, I'v met the same issue, have you solved it? And can you kindly share how?

Hunter1016 avatar Jun 17 '24 13:06 Hunter1016

@Hunter1016 , it was never resolved, it ended up being a hardware limitation on the platform (check Nvidia-smi output)

Yiltan avatar Jun 17 '24 17:06 Yiltan

@Hunter1016 , it was never resolved, it ended up being a hardware limitation on the platform (check Nvidia-smi output)

I'm sorry it's a bit confusing about hardware limitation checked from nvidia-smi output. I also tried running all_reduce_perf between two ranks each on one node, I got about 20GB/s algbw and busbw. I'm afraied it's not limited by hardware, more likely comes from software or configurations.

Hunter1016 avatar Jun 18 '24 03:06 Hunter1016