ucx
ucx copied to clipboard
rocm_ipc_md.c:69 UCX ERROR Failed to create ipc
Describe the bug
I've been following these instructions for ROCm-aware MPI on a Zen2 server node with a Radeon VII and ROCm-3.5.0. The large bar test passed, and the builds all went smoothly, but the OSU test
$ $OMPI_DIR/bin/mpirun -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
[1595887875.771550] [noether:3621053:0] parser.c:1626 UCX WARN unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1595887875.771658] [noether:3621052:0] parser.c:1626 UCX WARN unused env variable: UCX_DIR (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.67
2 1.32
4 2.66
8 5.28
16 10.73
32 11.04
64 11.86
128 12.51
256 19.71
512 38.49
1024 68.15
2048 121.26
4096 158.25
8192 167.94
[1595887877.188205] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188222] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188226] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188230] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188233] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188235] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[1595887877.188238] [noether:3621052:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7f99efe00000/4000
[...many screens of similar output...]
65536 205.33
[...]
The relevant function is
static hsa_status_t uct_rocm_ipc_pack_key(void *address, size_t length,
uct_rocm_ipc_key_t *key)
{
hsa_status_t status;
hsa_agent_t agent;
void *base_ptr;
size_t size;
status = uct_rocm_base_get_ptr_info(address, length, &base_ptr, &size, &agent);
if (status != HSA_STATUS_SUCCESS) {
ucs_error("pack none ROCM ptr %p/%lx", address, length);
return status;
}
status = hsa_amd_ipc_memory_create(base_ptr, size, &key->ipc);
if (status != HSA_STATUS_SUCCESS) {
ucs_error("Failed to create ipc for %p/%lx", address, length);
return status;
}
key->address = (uintptr_t)base_ptr;
key->length = size;
key->dev_num = uct_rocm_base_get_dev_num(agent);
return HSA_STATUS_SUCCESS;
}
Steps to Reproduce
Install latest versions per https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI and try the intranode test.
# UCT version=1.10.0 revision bae84af
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations --prefix=/projects/ucx/ucx --with-rocm=/opt/rocm --without-knem --without-cuda
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
- Debian bullseye
Linux noether 5.7.0-1-amd64 #1 SMP Debian 5.7.6-1 (2020-06-24) x86_64 GNU/Linux
- For GPU related issues:
- GPU type: Radeon VII. From
rocminfo:
- GPU type: Radeon VII. From
*******
Agent 9
*******
Name: gfx906
Uuid: GPU-7772516172dc76ba
Marketing Name: Vega 20 [Radeon VII]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 8
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26287(0x66af)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1802
BDFID: 41728
Internal Node ID: 8
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional information (depending on the issue)
Package: Open MPI jeka2967@noether Distribution
Open MPI: 5.0.0a1
Open MPI repo revision: v2.x-dev-7987-gc07d77fbf2
Open MPI release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 5.0.0a1
Prefix: /projects/ucx/ompi
Configured architecture: x86_64-pc-linux-gnu
Configured by: jeka2967
Configured on: Mon Jul 27 15:51:12 MDT 2020
Configure host: noether
Configure command line: '--prefix=/projects/ucx/ompi' '--with-ucx=/projects/ucx/ucx' '--without-verbs'
Hi @jedbrown We fixed several IPC issues in ROCm 3.7 release. Can you give it a try? It's recommended to uninstall any older ROCm version before installing 3.7.
I've been using ROCM-3.7 successfully for a few days, but unfortunately, this UCX error is still present with fresh rebuilds of ucx, ompi, and the osu benchmark.
$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.64
2 1.37
4 2.75
8 5.56
16 10.96
32 11.12
64 11.82
128 12.70
256 19.80
512 40.31
1024 69.53
2048 113.14
4096 155.10
8192 184.66
[1598594643.109801] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109821] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109826] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109829] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[1598594643.109834] [noether:2043685:0] rocm_ipc_md.c:69 UCX ERROR Failed to create ipc for 0x7fd9d7e00000/4000
[...]
I'm still seeing this issue with ROCm-4.0 and a fresh build of today's ucx (1d22f7486ef4202da30ee811a95ad394b862b9a1) and ompi (8ff2277b7e48b899341f69a9f3f9c9ee7cecf476).
$ $OMPI_DIR/bin/mpiexec -n 2 --mca btl '^openib' mpi/pt2pt/osu_bw -d rocm D D
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378737] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378758] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[noether:1378759] pmix_mca_base_component_repository_open: unable to open mca_ptl_usock: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.53
2 0.62
4 1.24
8 2.52
16 5.14
32 10.28
64 27.05
128 13.13
256 32.39
512 23.62
1024 23.42
2048 23.28
4096 23.27
8192 23.23
[1608394399.158674] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158695] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158699] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[1608394399.158702] [noether:1378758:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7fc8a7e00000/4000
[...]
Same issue is also present with today's MPICH 'main' (dac05cf7f9ec1a59e2d917f3da80fc943f378872)
$ $MPICH_DIR/bin/mpiexec -n 2 mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 0.51
2 0.64
4 1.24
8 2.69
16 5.46
32 11.48
64 28.43
128 14.47
256 34.43
512 26.37
1024 26.14
2048 25.80
4096 25.81
8192 25.71
[1608692521.150727] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150754] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150758] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
[1608692521.150760] [noether:1682552:0] rocm_ipc_md.c:70 UCX ERROR Failed to create ipc for 0x7f657e600000/4000
Hi @jedbrown, I believe it's a ROCR issue and not UCX. I would like to send you some test programs to debug it further. Is the email @jedbrown.org good to reach?
Yes, thanks.
The root cause was confirmed to be with ROCR support for Radeon VII and not UCX. An internal issue has been raised to resolve this.
@jedbrown, ROCr supports IPC on Radeon VII. The trouble here is it seems you are using the upstream amdgpu driver. This driver does not support IPC on any device. For IPC support you will need to install our DKMS amdgpu driver package. Unfortunately Debian is not a supported OS (Ubuntu is however) so our DKMS package may not install against your kernel.
Thanks, is the support going upstream? For various reasons, I'm not going to switch distributions, but we're currently on Linux 5.10 and follow the usual upgrades.
Has there been any update on this?
@simonbyrne can you please try with the ucx 1.13.0-rc1 release? We fixed an issue with ipc creation, (although we didn't test on ROCm 3.x, but we did on ROCm 4.5 and ROCm 5.x)
Thanks, 1.13-rc1 did fix my issue (I'm not sure what ROCm version it is using)