`hipGetDeviceCount` causes segmentation fault with OpenMP offloading when called during OMPT init
Note: rocm_smi_lib is affected by this issue as well. However, here we get an error message instead of a segmentation fault https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/129
Description
I was testing how OpenMP target regions interact with HIP in Score-P while I ran into the issue that the program did simply crash with a segmentation fault. While investigating the issue, I found that the crash did occur because of hipGetDeviceCount, which gets called during initialization of our HIP adapter.
The following code is a simplification of our issue, but shows what's going wrong:
#include <stdio.h>
#include <omp-tools.h>
#include <hip_runtime_api.h>
#define PRINT_RSMI_ERR(RET) { \
if (RET != RSMI_STATUS_SUCCESS) { \
printf("[ERROR] RSMI call returned %d at line %d\n", (RET), __LINE__); \
const char* error_string; \
rsmi_status_string( (RET), &error_string ); \
printf("[ERROR MESSAGE] %s\n", error_string); \
} \
}
static int
initialize_tool( ompt_function_lookup_t lookup,
int initialDeviceNum,
ompt_data_t* toolData )
{
return 1; /* non-zero indicates success */
}
static void
finalize_tool( ompt_data_t* toolData )
{}
ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version, /* == _OPENMP */
const char* runtime_version )
{
static ompt_start_tool_result_t tool = { &initialize_tool,
&finalize_tool,
ompt_data_none };
int devices;
hipGetDeviceCount(&devices);
printf("Number of devices = %d\n", devices);
return &tool;
}
int main( void )
{
}
It seems like multiple parts of the ROCm runtime have issues when being called before the OMPT interface has initialized. From what I've seen previously when discussing the OMPT function ompt_get_num_devices, this could be because the AMDGPU RTL is initialized way after the OMPT interface is initialized (comment)
What would I expect?:
The ideal scenario would be to get the correct number of devices. HIP seems to be the only adapter causing those issues, with CUDA working completely fine. If that's not possible, at least an error message would be sufficient instead of crashing the application outright.
Stack trace
Here's the stack trace via GDB:
Program received signal SIGSEGV, Segmentation fault.
0x00007fffeccc65ce in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0 0x00007fffeccc65ce in std::_Rb_tree_decrement(std::_Rb_tree_node_base*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00007ffff675fd37 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#2 0x00007ffff675c3cd in ?? () from /opt/rocm/lib/libamdhip64.so.5
#3 0x00007ffff675c63e in ?? () from /opt/rocm/lib/libamdhip64.so.5
#4 0x00007ffff672be85 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#5 0x00007ffff64905a5 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#6 0x00007ffff6099f68 in __pthread_once_slow (once_control=0x7ffff7e47868, init_routine=0x7fffeccdadb0 <__once_proxy>)
at ./nptl/pthread_once.c:116
#7 0x00007ffff64af19e in hipGetDeviceCount () from /opt/rocm/lib/libamdhip64.so.5
#8 0x0000000000205088 in ompt_start_tool ()
#9 0x00007ffff7fa4723 in ompt_pre_init () from /opt/rocm/llvm/lib/libomp.so
#10 0x00007ffff7f232ed in __kmp_do_serial_initialize() () from /opt/rocm/llvm/lib/libomp.so
#11 0x00007ffff7f2dc2c in __kmp_serial_initialize () from /opt/rocm/llvm/lib/libomp.so
#12 0x00007ffff7fa534f in libomp_ompt_connect () from /opt/rocm/llvm/lib/libomp.so
#13 0x00007ffff63b2fe2 in ompt_init() () from /opt/rocm/llvm/lib/libomptarget.so.16git
#14 0x00007ffff63c0f05 in init() () from /opt/rocm/llvm/lib/libomptarget.so.16git
#15 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffc728, env=env@entry=0x7fffffffc738)
at ./elf/dl-init.c:70
#16 0x00007ffff7fc9568 in call_init (env=0x7fffffffc738, argv=0x7fffffffc728, argc=1, l=<optimized out>) at ./elf/dl-init.c:33
#17 _dl_init (main_map=0x7ffff7ffe2e0, argc=1, argv=0x7fffffffc728, env=0x7fffffffc738) at ./elf/dl-init.c:117
#18 0x00007ffff7fe32ea in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#19 0x0000000000000001 in ?? ()
#20 0x00007fffffffcbd2 in ?? ()
#21 0x0000000000000000 in ?? ()
System configuration:
For testing, I did run the tool on the JURECA-DC AMDGPU Evaluation Node with 4 MI250X GPUs in an Ubuntu 22.04LTS container. However, I was also able to reproduce the issue on other machines.
$ hipconfig --full
HIP version : 5.6.31061-8c743ae5d
== hipconfig
HIP_PATH : /opt/rocm-5.6.0
ROCM_PATH : /opt/rocm-5.6.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/llvm/bin/../lib/clang/16.0.0
== hip-clang
HIP_CLANG_PATH : /opt/rocm-5.6.0/llvm/bin
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.0 23243 be997b2f3651a41597d7a41441fff8ade4ac59ac)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.6.0/llvm/bin
AMD LLVM version 16.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver3
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -isystem "/opt/rocm-5.6.0/include" -O3
hip-clang-ldflags : -O3 --hip-link --rtlib=compiler-rt -unwindlib=libgcc
=== Environment Variables
PATH=/opt/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH=/opt/rocm/lib:/opt/rocm/llvm/lib::/.singularity.d/libs
== Linux Kernel
Hostname : jrc0850.jureca
Linux jrc0850.jureca 4.18.0-477.15.1.el8_8.x86_64 #1 SMP Wed Jun 28 15:04:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
hipInfo:
Click for more details
$ ./hipInfo
--------------------------------------------------------------------------------
device# 0
Name: AMD Instinct MI250X/MI250
pciBusID: 41
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#1 device#2 device#3 device#4 device#5 device#6 device#7
non-peers: device#0
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 1
Name: AMD Instinct MI250X/MI250
pciBusID: 44
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#2 device#3 device#4 device#5 device#6 device#7
non-peers: device#1
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 2
Name: AMD Instinct MI250X/MI250
pciBusID: 47
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#1 device#3 device#4 device#5 device#6 device#7
non-peers: device#2
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 3
Name: AMD Instinct MI250X/MI250
pciBusID: 50
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#1 device#2 device#4 device#5 device#6 device#7
non-peers: device#3
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 4
Name: AMD Instinct MI250X/MI250
pciBusID: 173
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#1 device#2 device#3 device#5 device#6 device#7
non-peers: device#4
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 5
Name: AMD Instinct MI250X/MI250
pciBusID: 176
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#1 device#2 device#3 device#4 device#6 device#7
non-peers: device#5
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 6
Name: AMD Instinct MI250X/MI250
pciBusID: 179
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#1 device#2 device#3 device#4 device#5 device#7
non-peers: device#6
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
--------------------------------------------------------------------------------
device# 7
Name: AMD Instinct MI250X/MI250
pciBusID: 182
pciDeviceID: 0
pciDomainID: 0
multiProcessorCount: 104
maxThreadsPerMultiProcessor: 2048
isMultiGpuBoard: 0
clockRate: 1700 Mhz
memoryClockRate: 1600 Mhz
memoryBusWidth: 4096
totalGlobalMem: 63.98 GB
totalConstMem: 2147483647
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 65536
warpSize: 64
l2CacheSize: 8388608
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 9
minor: 0
concurrentKernels: 1
cooperativeLaunch: 1
cooperativeMultiDeviceLaunch: 1
isIntegrated: 0
maxTexture1D: 16384
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 16384
maxTexture3D.height: 16384
maxTexture3D.depth: 8192
isLargeBar: 1
asicRevision: 1
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate: 1000.00 Mhz
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArchName: gfx90a:sramecc+:xnack-
peers: device#0 device#1 device#2 device#3 device#4 device#5 device#6
non-peers: device#7
memInfo.total: 63.98 GB
memInfo.free: 63.91 GB (100%)
Can you share the output of the code with environment variable AMD_LOG_LEVEL=7.
So basically AMD_LOG_LEVEL=7 ./a.out
Here's the output with the environment variable. It seems like it depends on the compiler if the code above works or not:
$ amdclang -O0 -fopenmp --offload-arch=gfx90a reproducer.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__ -lamdhip64
$ AMD_LOG_LEVEL=7 ./a.out
Segmentation fault
$ hipcc -O0 -fopenmp --offload-arch=gfx90a reproducer.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__
:3:rocdevice.cpp :434 : 4089833736985 us: 27791: [tid:0x14d3f4ccfa80] Initializing HSA stack.
:3:comgrctx.cpp :33 : 4089833796108 us: 27791: [tid:0x14d3f4ccfa80] Loading COMGR library.
:3:rocdevice.cpp :200 : 4089833796142 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[2]=0x2215500(fine=0x2215720,coarse=0x220e640) for gpu agent=0x2273870
:3:rocdevice.cpp :1634: 4089833796462 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833796675 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf3400000, size 0x101000
:4:rocdevice.cpp :2012: 4089833796957 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf3200000, size 0x101000
:3:rocdevice.cpp :200 : 4089833797048 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[2]=0x2215500(fine=0x2215720,coarse=0x220e640) for gpu agent=0x2278160
:3:rocdevice.cpp :1634: 4089833797140 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833797316 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf3000000, size 0x101000
:4:rocdevice.cpp :2012: 4089833797552 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2e00000, size 0x101000
:3:rocdevice.cpp :200 : 4089833797633 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[2]=0x2215500(fine=0x2215720,coarse=0x220e640) for gpu agent=0x227c5a0
:3:rocdevice.cpp :1634: 4089833797719 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833797896 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2c00000, size 0x101000
:4:rocdevice.cpp :2012: 4089833798237 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2a00000, size 0x101000
:3:rocdevice.cpp :200 : 4089833798315 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[2]=0x2215500(fine=0x2215720,coarse=0x220e640) for gpu agent=0x22809e0
:3:rocdevice.cpp :1634: 4089833798400 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833798584 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2800000, size 0x101000
:4:rocdevice.cpp :2012: 4089833798993 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2600000, size 0x101000
:3:rocdevice.cpp :200 : 4089833799188 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[6]=0x2267890(fine=0x2267a80,coarse=0x2268200) for gpu agent=0x2284e20
:3:rocdevice.cpp :1634: 4089833799268 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833799766 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2400000, size 0x101000
:4:rocdevice.cpp :2012: 4089833800301 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2200000, size 0x101000
:3:rocdevice.cpp :200 : 4089833800378 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[6]=0x2267890(fine=0x2267a80,coarse=0x2268200) for gpu agent=0x2289230
:3:rocdevice.cpp :1634: 4089833800455 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833800938 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf2000000, size 0x101000
:4:rocdevice.cpp :2012: 4089833801480 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf1e00000, size 0x101000
:3:rocdevice.cpp :200 : 4089833801556 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[6]=0x2267890(fine=0x2267a80,coarse=0x2268200) for gpu agent=0x228d670
:3:rocdevice.cpp :1634: 4089833801634 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833802120 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf1c00000, size 0x101000
:4:rocdevice.cpp :2012: 4089833802654 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf1a00000, size 0x101000
:3:rocdevice.cpp :200 : 4089833802731 us: 27791: [tid:0x14d3f4ccfa80] Numa selects cpu agent[6]=0x2267890(fine=0x2267a80,coarse=0x2268200) for gpu agent=0x2291a80
:3:rocdevice.cpp :1634: 4089833802995 us: 27791: [tid:0x14d3f4ccfa80] HMM support: 1, xnack: 0, direct host access: 0
:4:rocdevice.cpp :2012: 4089833803028 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14d3f4723000, size 0x1c0
:4:rocdevice.cpp :2012: 4089833803518 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf1800000, size 0x101000
:4:rocdevice.cpp :2012: 4089833804045 us: 27791: [tid:0x14d3f4ccfa80] Allocate hsa host memory 0x14cbf1600000, size 0x101000
:4:runtime.cpp :83 : 4089833804192 us: 27791: [tid:0x14d3f4ccfa80] init
:3:hip_context.cpp :48 : 4089833804195 us: 27791: [tid:0x14d3f4ccfa80] Direct Dispatch: 1
:3:hip_device_runtime.cpp :548 : 4089833804221 us: 27791: [tid:0x14d3f4ccfa80] hipGetDeviceCount: Returned hipSuccess :
Number of devices = 8
Please note that I had to add artifical OpenMP regions to the main function to get the OMPT interface to initialize with hipcc. With amdclang the interface gets initialized even with an empty main function.
Seems like the difference is the flag --hip-link which is added by hipcc during compilation
$ amdclang -fopenmp --offload-arch=gfx90a -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__ reproducer.c -lamdhip64 --hip-link
$ ./a.out
Number of devices = 0
This solves the issue, but crashing with a segmentation fault without the flag --hip-link still seems wrong.
sounds like a compiler behavior with a flag issue. tagging @yxsamliu who is way more proficient in compilers than me.
After testing a bit more, I found out why --hip-link solves the issue. The flag also removes the library for OpenMP target offloading -lomptarget (libomptarget.so). Adding the flag causes the same issue to show up again.
Without the flag, OpenMP target regions will just run on the host instead.
when you use --hip-link, the compiler will assume it is HIP program and uses HIP toolchain, then it will not enable OpenMP target offloading, therefore it won't link with -lopmtarget. This is expected behaviour.
when you use -fopenmp --offload-arch with C++ program, OpenMP target offloading is enabled. You are calling hipGetDeviceCount from a C++ program. I suspect the error is due to HIP runtime not being initialized properly in a C++ program.
What if you disable the OpenMP stuff and just call hipGetDeviceCount in a C++ program without -fopenmp --offload-arch?
Using a very small test program just calling hipGetDeviceCount from main works totally fine with amdclang and OpenMP target offloading (tested on a system with no AMD GPU):
$ cat reproducer_non_ompt.c
#include <stdio.h>
#include <hip_runtime_api.h>
#define PRINT_RSMI_ERR(RET) { \
if (RET != RSMI_STATUS_SUCCESS) { \
printf("[ERROR] RSMI call returned %d at line %d\n", (RET), __LINE__); \
const char* error_string; \
rsmi_status_string( (RET), &error_string ); \
printf("[ERROR MESSAGE] %s\n", error_string); \
} \
}
int main( void )
{
printf("Enter main\n");
int devices;
hipGetDeviceCount(&devices);
printf("Number of devices = %d\n", devices);
#pragma omp target
{}
#pragma omp parallel
{}
}
$ amdclang -fopenmp --offload-arch=gfx90a -L/opt/rocm/lib -lamdhip64 reproducer_non_ompt.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__
$ ./a.out
Enter main
Number of devices = 0
The previous reproducer also works when not using OpenMP target offloading, but fails when it is enabled.
$ cat reproducer.c
#include <stdio.h>
#include <omp-tools.h>
#include <hip_runtime_api.h>
#define PRINT_RSMI_ERR(RET) { \
if (RET != RSMI_STATUS_SUCCESS) { \
printf("[ERROR] RSMI call returned %d at line %d\n", (RET), __LINE__); \
const char* error_string; \
rsmi_status_string( (RET), &error_string ); \
printf("[ERROR MESSAGE] %s\n", error_string); \
} \
}
static int
initialize_tool( ompt_function_lookup_t lookup,
int initialDeviceNum,
ompt_data_t* toolData )
{
return 1; /* non-zero indicates success */
}
static void
finalize_tool( ompt_data_t* toolData )
{}
#ifdef __cplusplus
extern "C"
{
#endif
ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version, /* == _OPENMP */
const char* runtime_version )
{
static ompt_start_tool_result_t tool = { &initialize_tool,
&finalize_tool,
ompt_data_none };
int devices;
hipGetDeviceCount(&devices);
printf("Number of devices = %d\n", devices);
return &tool;
}
#ifdef __cplusplus
}
#endif
int main( void )
{
printf("Enter main\n");
#pragma omp target
{}
#pragma omp parallel
{}
}
$ amdclang -fopenmp --offload-arch=gfx90a -L/opt/rocm/lib -lamdhip64 reproducer.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__
$ ./a.out
Segmentation fault (core dumped)
$ amdclang -fopenmp -L/opt/rocm/lib -lamdhip64 reproducer.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__
Enter main
Number of devices = 0
I have an idea why this happens. We can use the environment variable LD_DEBUG=libs to check what libraries are initialized. Looking closely, there's one important difference:
$ # Without OpenMP target offloading
$ amdclang -fopenmp -L/opt/rocm/lib -lamdhip64 reproducer.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__
$ LD_DEBUG=libs ./a.out
[...]
599199: calling init: /lib64/ld-linux-x86-64.so.2
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libc.so.6
599199:
599199:
599199: calling init: /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2
599199:
599199:
599199: calling init: /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libz.so.1
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libelf.so.1
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libtinfo.so.6
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libgcc_s.so.1
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libm.so.6
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libstdc++.so.6
599199:
599199:
599199: calling init: /lib/x86_64-linux-gnu/libnuma.so.1
599199:
599199:
599199: calling init: /opt/rocm/lib/libhsa-runtime64.so.1
599199:
599199:
599199: calling init: /opt/rocm/lib/libamd_comgr.so.2
599199:
599199:
599199: calling init: /opt/rocm/llvm/lib/libomp.so
599199:
599199:
599199: calling init: /opt/rocm/lib/libamdhip64.so.5
599199:
599199:
599199: initialize program: ./a.out
599199:
599199:
599199: transferring control: ./a.out
599199:
Enter main
Number of devices = 0
$ # With OpenMP target offloading
$ amdclang -fopenmp --offload-arch=gfx90a -L/opt/rocm/lib -lamdhip64 reproducer.c -I/opt/rocm/include/hip -D__HIP_PLATFORM_AMD__
$ LD_DEBUG=libs ./a.out
[...]
599027: calling init: /lib64/ld-linux-x86-64.so.2
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libc.so.6
599027:
599027:
599027: calling init: /opt/amdgpu/lib/x86_64-linux-gnu/libdrm.so.2
599027:
599027:
599027: calling init: /opt/amdgpu/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libz.so.1
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libelf.so.1
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libtinfo.so.6
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libnuma.so.1
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libm.so.6
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libgcc_s.so.1
599027:
599027:
599027: calling init: /lib/x86_64-linux-gnu/libstdc++.so.6
599027:
599027:
599027: calling init: /opt/rocm/lib/libhsa-runtime64.so.1
599027:
599027:
599027: calling init: /opt/rocm/lib/libamd_comgr.so.2
599027:
599027:
599027: calling init: /opt/rocm/llvm/lib/libomptarget.so.16git
599027:
599027:
599027: calling init: /opt/rocm/llvm/lib/libomp.so
599027:
Segmentation fault (core dumped)
In the case without OpenMP target offloading, libomptarget.so.16git is not loaded, which I would expect. At the start of main the library libamdhip64.so.5 is present. Calling hipGetDeviceCount succeeds. When enabling OpenMP target offloading, the initialization of OMPT is done before initializing the program. At this point libamdhip64.so.5 is missing, which, I assume, causes the function call to end up in a segmentation fault. The full output of both runs with LD_DEBUG is attached.
I don't know if that's a thing which can be fixed from the HIP side or if work at the LLVM toolchain is required. However, I would at least expect to get an error message from the call and not a segmentation fault. We could (probably) work with that.
@Thyre Has this been resolved? If so, please close ticket. Thanks!
The issue probably is not resolved yet with ROCm, as it was just recently resolved in LLVM 19git. I'll recheck tomorrow.
In general, I would consider this as an LLVM issue and not related to your libraries.
@Thyre Thanks for the update! If you can confirm tomorrow that it's resolved with LLVM, lets go ahead and close the ticket. Thanks!
I can confirm that the issue is fixed with LLVM 19git and will therefore eventually also land in ROCm.
As the limitation seems to come from tying to call hip functions during _dl_start_user, this limitation should probably be documented somewhere if not done already.
CUDA for example includes this paragraph in their documentation:
The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation (or termination after main) will result in undefined behavior.
I'm closing the issue.