[Issue]: ROCm 6.3.x: segfault in ihipMallocManaged when no devices are available
Problem Description
Any HIP program that uses managed variables crashes immediately with a segmentation fault and shows the following backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff69448a8 in ihipMallocManaged(void**, unsigned long, unsigned int) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#0 0x00007ffff69448a8 in ihipMallocManaged(void**, unsigned long, unsigned int) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#1 0x00007ffff6a140ec in __hipRegisterManagedVar () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#2 0x0000000000209cc1 in __hip_module_ctor ()
#3 0x00007ffff60296fb in __libc_start_main_impl () from /lib64/libc.so.6
#4 0x0000000000209bc5 in _start ()
Operating System
Rocky Linux 9.5 (Blue Onyx)
CPU
AMD EPYC 7413 24-Core Processor
GPU
AMD Instinct MI210
ROCm Version
ROCm 6.3.0
ROCm Component
HIP
Steps to Reproduce
Complete reproducer:
#!/bin/sh
cat <<EOF >> reproducer_hip.cpp
__managed__ int managed_var;
int main()
{
return 0;
}
EOF
hipcc -g --offload-arch=gfx90a reproducer_hip.cpp
gdb.minimal -batch -ex "run" -ex "bt" ./a.out 2>&1 | grep -v ^"No stack."$
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
ROCk module version 6.10.5 is loaded
Unable to open /dev/kfd read-write: Permission denied
bwibking is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully.
Additional Information
No response
Hi @BenWibking. Internal ticket has been created to investigate your issue. Thanks!
@ppanchad-amd Is there any update on this?
Hi @BenWibking. There is a PR (https://github.com/ROCm/clr/pull/122) that should fix this issue. Thanks!
FYI Old PR became stale. The new PR is: https://github.com/ROCm/clr/pull/160
This is addressed by internal PR https://github.com/AMD-ROCm-Internal/clr/pull/597. Since it is for the internal repo you may not be able to access the link. I will update the status here once the change gets merged.
@BenWibking The issue has been resolved by this commit Defer allocation of managed variable that is now merged into the staging branch and will be included in a future release. Thanks again for the report.