ucx
ucx copied to clipboard
Failing unit tests with ucx 1.14.0 and ROCm 5.1.0
Describe the bug
ROCm related unit test failed see rocm.log
Setup and versions
- CentOs stream 8
- rdma-core-55mlnx-37-1.55103.x86_64
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.32.1010
node_guid: 043f:7203:00da:6a3e
sys_image_guid: 043f:7203:00da:6a3e
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2
port_lmc: 0x00
link_layer: InfiniBand
- For GPU related issues: see rocminfo.log
Additional information (depending on the issue)
- OpenMPI version
ucx_info -v
# Library version: 1.14.0
# Library path: /home/greole/.local/lib/libucs.so.0
# API headers version: 1.14.0
# Git branch 'master', revision 130c572
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/greole/.local --with-rocm=/opt/rocm-5.1.0 --with-rdmacm
- Configure result - config.log see config.log
This https://github.com/openucx/ucx/issues/5485 seems to be related.
@greole I can unfortunately not reproduce the bug, all gtests for rocm work on my setup with rocm 5.1.1. The bug that you were pointing at as a related issues should be fixed with pr https://github.com/openucx/ucx/pull/8197 and pr https://github.com/openucx/ucx/pull/8315 .
Could you maybe try the most recent stable ucx version 1.13.0 (released just a couple of days ago), and double check that you are not accidentally pulling in an old ucx version through LD_LIBRARY_PATH or similar?
Could you maybe also confirm that the large-bar test detailed here is working for you? Otherwise this could hint at a permission problem (e.g. that you are not part of the video or render group) which could manifest itself in different ways,
Thanks for having a look at this.
Could you maybe also confirm that the large-bar test detailed here is working for you? Otherwise this could hint at a permission problem (e.g. that you are not part of the video or render group) which could manifest itself in different ways,
The large-bar test works:
[greole ] ./check_large_bar
address buf 0x7fc8c0600000
Buf[0] = -1094795586
Buf[0] = 1
I'll do some further tests to see where the problem arises.