aomp
aomp copied to clipboard
unable to open hip GPU device (gfx1030) with latest AOMP
I have built latest AOMP (SHA: e2f40a73975ccc0f453e8767a8016e6deb849782) with the amd-stg-open branch. However it is unable to enumerate the HIP GPU device though rocminfo shows both cpu and gpu. I have a 6900XT (gfx1030) and am trying to get Tensile to work on it.
(I have this https://github.com/ROCm-Developer-Tools/HIP/pull/2219 locally to fix the clang_rt builtin issue on hosts).
See below:
I am running this code: https://gitlab.com/cscs-ci/ci-testing/ault-amdgpu/-/blob/master/helloworld.cpp Got an error hipErrorNoDevice
I verified I am in the video group and sudo doesn't help.
5950x:~/github/aomp$ /opt/rocm/bin/rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 9 5950X 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 5950X 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3400
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 131896948(0x7dc9674) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 131896948(0x7dc9674) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx1030
Uuid: GPU-XX
Marketing Name: Device 73bf
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 29631(0x73bf)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2660
BDFID: 12544
Internal Node ID: 1
Compute Unit: 80
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 64(0x40)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1030
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
5950x:~/github/aomp$ $HIP_PATH/bin/hipconfig --full
HIP version : 4.0.20496-4f163c68
== hipconfig
HIP_PATH : /home/foo/rocm/aomp
ROCM_PATH : /home/foo/rocm/aomp_13.0-2
HIP_COMPILER : clang
HIP_PLATFORM : hcc
HIP_RUNTIME : ROCclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -I/home/foo/rocm/aomp/include -I/home/foo/rocm/aomp/bin/../lib/clang/13.0.0 -I/home/foo/rocm/aomp_13.0-2/hsa/include -D__HIP_ROCclr__
== hip-clang
HSA_PATH : /home/foo/rocm/aomp_13.0-2/hsa
HIP_CLANG_PATH : /home/foo/rocm/aomp/bin
AOMP_STANDALONE_13.0-2 clang version 13.0.0 (https://github.com/ROCm-Developer-Tools/llvm-project 0e52e2879ab1bbfb75630b97aa25a28ec9e73a1e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/foo/rocm/aomp/bin
AOMP-13.0-2 (http://github.com/ROCm-Developer-Tools/aomp):
Source ID:13.0-2-0e52e2879ab1bbfb75630b97aa25a28ec9e73a1e
LLVM version 13.0.0_AOMP_STANDALONE_13.0-2
Optimized build with assertions.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver3
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -D__HIP_ROCclr__ -std=c++11 -isystem /home/foo/rocm/aomp_13.0-2/lib/clang/13.0.0/include/.. -isystem /home/foo/rocm/aomp_13.0-2/hsa/include -D__HIP_ROCclr__ -isystem /home/foo/rocm/aomp/include -O3
hip-clang-ldflags : -L/home/foo/rocm/aomp/lib -O3 -lgcc_s -lgcc -lpthread -lm
=== Environment Variables
PATH=/home/foo/anaconda3/bin:/home/foo/bin:/home/foo/lokal/bin/:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/home/foo/bin:/home/foo/anaconda3/bin:/home/foo/bin:/home/foo/lokal/bin/:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/home/foo/rocm/aomp/bin:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/home/foo/rocm/aomp/bin
LD_LIBRARY_PATH=/home/foo/github/mmperf/b_cuda/tvm-install/lib/:/home/foo/rocm/aomp/lib:/usr/local/cuda/lib64/:/usr/local/cuda/extras/CUPTI/lib64
HIP_PATH=/home/foo/rocm/aomp
CUDA_ROOT=/usr/local/cuda
CUDA=/usr/local/cuda
== Linux Kernel
Hostname : 5950x
Linux 5950x 5.11.0-051100rc7-generic #202102072330 SMP Sun Feb 7 23:33:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
I am in both video and render groups:
id uid=1000(foo) gid=1000(foo) groups=1000(foo),4(adm),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),109(render),120(lpadmin),131(lxd),132(sambashare)
I suggested export ROCM_LLC_ARGS="--amdhsa-code-object-version=3" in a side channel. That tells amd-stg-open clang to use the same version that rocr expects. Unfortunately that doesn't appear to resolve this.
yup. I tried both just setting the env var before building my sample with hipcc and that didn't help. Rebuilding all of aomp with that env var set doesn't help either.
Happy to gather any other debug information that is relevant. To make sure there is nothing in /opt/rocm I only have rocm-smi there
ls -ltr /opt/rocm-4.0.0/lib/* lrwxrwxrwx 1 root root 34 Dec 14 02:49 /opt/rocm-4.0.0/lib/librocm_smi64.so.2 -> ../rocm_smi/lib/librocm_smi64.so.2 lrwxrwxrwx 1 root root 32 Dec 14 02:49 /opt/rocm-4.0.0/lib/librocm_smi64.so -> ../rocm_smi/lib/librocm_smi64.so
If required I can rebuild that too but I doubt that could be the issue.
Thanks for your quick responses.
Ah. I didn't notice you were using hipcc. When I try to run hip code locally, I get a variant on 'no devices found', which seems to correlate with an invalid branch in the hip runtime. Running the host application under valgrind blames libamdhip64.so at least. Hopefully Greg has more information on that, I haven't tried to debug the hip runtime.
Just getting familiar with runtimes. What other runtime can I use? I am trying to get Tensile going with gfx1030 which seems to require hipcc.
Yeah gdb points to libamdhip64.so.
The bottom of the stack on linux is kfd (in the linux kernel), then roct which is roughly the userspace driver part of kfd. On top of that is an implementation of the HSA spec, rocr. Those have all been robust under my testing. The OpenMP implementation on amdgpu builds directly on top of rocr for that reason.
Depending on your use case, c++ compiled for amdgcn as freestanding and launched using the functions in hsa.h works well. Opencl has its own runtime, but it looks like it's now built on the same foundation as hip so may have the same bug reported here. Windows does some different things, and so does the graphics stack.
libamdhip64.so contains, as far as I can tell, roct, rocr, rocclr, hip. Something in that appears to be broken. There's a lot of code though so it's not an easy fix. HIP mostly track errors through an internal Jira system.
Is Tensile the rocm library with that name? If so, an issue suggests it worked on a gfx1010 in November. You might therefore be able to get a working HIP installation by rolling back to a release made around then. I've added Siu Chi to this issue as he is much closer to the HIP development than me.
Cool. Thanks for the clarity - just so many rocXX libs it was hard to understand the layering. I think c++ compiled for amdgcn and launched with hsa.h is best for us. I will look around for rocr samples as a starting point.
I was trying to get Tensile up and running on gfx1030 because those are the "baseline" GEMM routines for rocblas and want to compare to that performance too. I filed a few issues about it https://github.com/ROCmSoftwarePlatform/Tensile/issues/1282 https://github.com/RadeonOpenCompute/ROCm/issues/1376 https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/1185
Unfortunately looks like the last release of rocr was 3.1.x and there is no 4.x or later branches https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/111
Are you able to test with the opensource rocr from https://github.com/RadeonOpenCompute/ROCR-Runtime ? Any chance we can get an updated rocr or is 3.1.x supposed to work for gfx10 ?
ok so rocr seems to be working. I have verified that with rocm_bandwidth_test (https://github.com/RadeonOpenCompute/rocm_bandwidth_test) since rocr-runtime doesn't have any tests.
so something is broken along rocclr / hip for gfx10.
./rocm-bandwidth-test ........ RocmBandwidthTest Version: 2.5.1
Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 9 5950X 16-Core Processor
Device: 1, Device 73bf, GPU-XX, 31:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 7.030
1 7.262 1259.078
Bidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 13.292
1 13.292 N/A
Thanks for the pointers.
@JonChesterfield do you have any examples / tests that do the "c++ compiled for amdgcn as freestanding and launched using the functions in hsa.h " ? I am trying to follow https://github.com/RadeonOpenCompute/rocminfo as an example but I dont see gcn binaries in the final elf file that goes into the rocr / hsa runtime.
update: found https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra to play around with.
Update 2: I have been able to run simple code after updating to code object version 3 . Pushed a fork https://github.com/Powderluv/LLVM-AMDGPU-Assembler-Extra
Hey. I missed the above comments but saw this while looking at the tangentially related #193. I'm not clear what the status of the gfx10 cards is - the 4.1 release notes don't seem to mention it. Unofficially some code does seem to run on them, and I believe rocr and the compiler backend are functional. OpenMP does not work on gfx10 yet, working on that at present.
The code object format is currently transitioning from 3 to 4. I think the status is rocm 3.10 needs v3, rocm 4.1 can use v4, llvm trunk is reviewing patches to bring v4 online.
Using raw C++ means trading the many conveniences of the high level languages for an increase in control. Documentation is sparse, your mileage may vary. Nevertheless, an example of going down that rabbit hole is https://github.com/jonChesterfield/hostrpc, which is a bare metal prototype that I'm hoping to implement libc on top of (thus getting away from freestanding for applications). You may find it interesting but it's not production code yet.
Compiling as freestanding invocation is along the lines of:
GFX=gfx906 clang -O2 -ffreestanding --target=amdgcn-amd-amdhsa -march=$GFX -mcpu=$GFX -nogpulib -emit-llvm
That will emit IR for a gfx906. Functions, data and so forth.
To get something that can be launched, one currently needs to use opencl/hip/openmp/IR/asm as the kernel calling convention is not exposed to c++. That's somewhat annoying but the 'kernel' function only needs to contain a call to something written in C. E.g.:
int __device_start_cast(int argc, __global void* argv);
static unsigned get_lane_id(void)
{
return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
}
kernel void __device_start(int argc, __global void* argv, __global int* res)
{
res[get_lane_id()] = __device_start_cast(argc, argv);
}
given some IR that contains one or more kernel functions, llc can emit a code object which the hsa loader can run on the gpu. The interface to that is RadeonOpenCompute/ROCR-Runtime/src/inc/hsa.h. It's verbose, but works broadly as the comments suggest.
Thank you for this. hostrpc seems very useful. We will give it a spin and post issues here or on the hostrpc repo.
Also libc would be fantastic along with some utils for debugging and logging
OpenMP team, what is status of AOMP on gfx1030? Should we get a test machine in our AOMP lab?
FYI, https://github.com/RadeonOpenCompute/ROCm/issues/887#issuecomment-822222885 I hope once ROCm side enables RDNA, AOMP works out of box. Right now, nailing the software on GFX9 is really critical.
RocmBandwidthTest Version: 2.6.0 / rocm-5.1.2 gfx1030 / uname = 5.4.0-122-generic
Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, 11th Gen Intel(R) Core(TM) i5-11400F @ 2.60GHz
Device: 1, AMD Radeon RX 6800, GPU-XX, 03:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 26.662
1 28.566 848.405
Bidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 31.417
1 31.417 N/A