HIP icon indicating copy to clipboard operation
HIP copied to clipboard

[Issue]: ROCm 6.3.x: segfault in ihipMallocManaged when no devices are available

Open BenWibking opened this issue 1 year ago • 4 comments

Problem Description

Any HIP program that uses managed variables crashes immediately with a segmentation fault and shows the following backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff69448a8 in ihipMallocManaged(void**, unsigned long, unsigned int) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#0  0x00007ffff69448a8 in ihipMallocManaged(void**, unsigned long, unsigned int) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#1  0x00007ffff6a140ec in __hipRegisterManagedVar () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#2  0x0000000000209cc1 in __hip_module_ctor ()
#3  0x00007ffff60296fb in __libc_start_main_impl () from /lib64/libc.so.6
#4  0x0000000000209bc5 in _start ()

Operating System

Rocky Linux 9.5 (Blue Onyx)

CPU

AMD EPYC 7413 24-Core Processor

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.3.0

ROCm Component

HIP

Steps to Reproduce

Complete reproducer:

#!/bin/sh

cat <<EOF >> reproducer_hip.cpp
__managed__ int managed_var;
int main()
{
  return 0;
}
EOF

hipcc -g --offload-arch=gfx90a reproducer_hip.cpp
gdb.minimal -batch -ex "run" -ex "bt" ./a.out 2>&1 | grep -v ^"No stack."$

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.10.5 is loaded
Unable to open /dev/kfd read-write: Permission denied
bwibking is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully.

Additional Information

No response

BenWibking avatar Dec 27 '24 03:12 BenWibking

Hi @BenWibking. Internal ticket has been created to investigate your issue. Thanks!

ppanchad-amd avatar Dec 30 '24 17:12 ppanchad-amd

@ppanchad-amd Is there any update on this?

BenWibking avatar Mar 12 '25 20:03 BenWibking

Hi @BenWibking. There is a PR (https://github.com/ROCm/clr/pull/122) that should fix this issue. Thanks!

ppanchad-amd avatar May 05 '25 16:05 ppanchad-amd

FYI Old PR became stale. The new PR is: https://github.com/ROCm/clr/pull/160

benrichard-amd avatar May 05 '25 19:05 benrichard-amd

This is addressed by internal PR https://github.com/AMD-ROCm-Internal/clr/pull/597. Since it is for the internal repo you may not be able to access the link. I will update the status here once the change gets merged.

iassiour avatar Jun 27 '25 08:06 iassiour

@BenWibking The issue has been resolved by this commit Defer allocation of managed variable that is now merged into the staging branch and will be included in a future release. Thanks again for the report.

iassiour avatar Aug 07 '25 10:08 iassiour