ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Failing unit tests with ucx 1.14.0 and ROCm 5.1.0

Open greole opened this issue 2 years ago • 4 comments

Describe the bug

ROCm related unit test failed see rocm.log

Setup and versions

  • CentOs stream 8
  • rdma-core-55mlnx-37-1.55103.x86_64
ibv_devinfo 
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.32.1010
        node_guid:                      043f:7203:00da:6a3e
        sys_image_guid:                 043f:7203:00da:6a3e
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2
                        port_lmc:               0x00
                        link_layer:             InfiniBand

Additional information (depending on the issue)

  • OpenMPI version
ucx_info -v
# Library version: 1.14.0
# Library path: /home/greole/.local/lib/libucs.so.0
# API headers version: 1.14.0
# Git branch 'master', revision 130c572
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/home/greole/.local --with-rocm=/opt/rocm-5.1.0 --with-rdmacm

greole avatar Jul 09 '22 12:07 greole

This https://github.com/openucx/ucx/issues/5485 seems to be related.

greole avatar Jul 12 '22 07:07 greole

@greole I can unfortunately not reproduce the bug, all gtests for rocm work on my setup with rocm 5.1.1. The bug that you were pointing at as a related issues should be fixed with pr https://github.com/openucx/ucx/pull/8197 and pr https://github.com/openucx/ucx/pull/8315 .

Could you maybe try the most recent stable ucx version 1.13.0 (released just a couple of days ago), and double check that you are not accidentally pulling in an old ucx version through LD_LIBRARY_PATH or similar?

edgargabriel avatar Jul 12 '22 12:07 edgargabriel

Could you maybe also confirm that the large-bar test detailed here is working for you? Otherwise this could hint at a permission problem (e.g. that you are not part of the video or render group) which could manifest itself in different ways,

edgargabriel avatar Jul 12 '22 14:07 edgargabriel

Thanks for having a look at this.

Could you maybe also confirm that the large-bar test detailed here is working for you? Otherwise this could hint at a permission problem (e.g. that you are not part of the video or render group) which could manifest itself in different ways,

The large-bar test works:

[greole ] ./check_large_bar 
address buf 0x7fc8c0600000 
Buf[0] = -1094795586
Buf[0] = 1

I'll do some further tests to see where the problem arises.

greole avatar Jul 15 '22 05:07 greole