llvm-project
llvm-project copied to clipboard
[OpenMP][AMDGPU] hit assertion when exiting
checked b0c4cd35df89479ec152c1f79e18d0264dd276cc reproducer code
$ clang++ -fopenmp --offload-arch=gfx906 target_taskwait.cpp && LIBOMPTARGET_DEBUG=1 OMP_TARGET_OFFLOAD=mandatory ./a.out
'+atomic-fadd-insts' is not a recognized feature for this target (ignoring feature)
'+atomic-fadd-insts' is not a recognized feature for this target (ignoring feature)
outside a = 0 addr 0x7ffd43530878
....
Target AMDGPU RTL --> Finalizing the AMDGPU DeviceInfo.
a.out: tpp.c:82: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
Aborted (core dumped)
@llvm/issue-subscribers-openmp
CI still green, oddly. Wonder if that's running with asserts disabled https://lab.llvm.org/buildbot/#/builders/193
It is also possible that tests under CI doesn't cover the reproducer case.
@JonChesterfield does the reproducer fail on your machine?
@JonChesterfield @jhuber6 Any insights
~~dlopen libhsa build has this issue. if the plugin is built agaist libhsa directly, no problem.~~ When my test pass in certain scenarios, strace shows pulling libomptarget.rtl.amdgpu.so from rocm/5.2.0. Once I fix the LD_LIBRARY_PATH. The issue is reproducible on any machines.
CI still green, oddly. Wonder if that's running with asserts disabled https://lab.llvm.org/buildbot/#/builders/193
@JonChesterfield @ronlieb could you check if the desired libomptarget.rtl.amdgpu.so gets picked during test-openmp.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff693e7f1 in __GI_abort () at abort.c:79
#2 0x00007ffff692e3fa in __assert_fail_base (fmt=0x7ffff6ab56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust",
file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433,
function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:92
#3 0x00007ffff692e472 in __GI___assert_fail (
assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust",
file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433,
function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:101
#4 0x00007ffff6cf8fa3 in __pthread_mutex_lock_full (mutex=0x5555557ae7c0) at ../nptl/pthread_mutex_lock.c:433
#5 0x00007fffed5b2759 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#6 0x00007fffed5ccedc in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#7 0x00007fffed5ccf49 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#8 0x00007fffed5d9942 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#9 0x00007fffeda1e475 in RTLDeviceInfoTy::~RTLDeviceInfoTy() ()
from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#10 0x00007fffeda1ee75 in __tgt_rtl_deinit_plugin () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#11 0x00007ffff712e9d4 in deinit() () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.so.16git
#12 0x00007ffff7de3d13 in _dl_fini () at dl-fini.c:138
#13 0x00007ffff6941031 in __run_exit_handlers (status=0, listp=0x7ffff6ce9718 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#14 0x00007ffff694112a in __GI_exit (status=<optimized out>) at exit.c:139
#15 0x00007ffff691fc8e in __libc_start_main (main=0x555555554af0 <main>, argc=1, argv=0x7fffffff85c8, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff85b8) at ../csu/libc-start.c:344
#16 0x0000555555554a0a in _start ()
(gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007ffff693e7f1 in __GI_abort () at abort.c:79 #2 0x00007ffff692e3fa in __assert_fail_base (fmt=0x7ffff6ab56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust", file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433, function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:92 #3 0x00007ffff692e472 in __GI___assert_fail ( assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust", file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433, function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:101 #4 0x00007ffff6cf8fa3 in __pthread_mutex_lock_full (mutex=0x5555557ae7c0) at ../nptl/pthread_mutex_lock.c:433 #5 0x00007fffed5b2759 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1 #6 0x00007fffed5ccedc in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1 #7 0x00007fffed5ccf49 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1 #8 0x00007fffed5d9942 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1 #9 0x00007fffeda1e475 in RTLDeviceInfoTy::~RTLDeviceInfoTy() () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so #10 0x00007fffeda1ee75 in __tgt_rtl_deinit_plugin () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so #11 0x00007ffff712e9d4 in deinit() () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.so.16git #12 0x00007ffff7de3d13 in _dl_fini () at dl-fini.c:138 #13 0x00007ffff6941031 in __run_exit_handlers (status=0, listp=0x7ffff6ce9718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108 #14 0x00007ffff694112a in __GI_exit (status=<optimized out>) at exit.c:139 #15 0x00007ffff691fc8e in __libc_start_main (main=0x555555554af0 <main>, argc=1, argv=0x7fffffff85c8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff85b8) at ../csu/libc-start.c:344 #16 0x0000555555554a0a in _start ()
Do you know if this is a recent problem? We changed the lifetime of when this destructor is called primarily so we can still call the user's destructors on the GPU. It's possible that it's outliving something it has a reference to. The last change was done in this patch.
Was this fixed by https://github.com/llvm/llvm-project/commit/292cb114b0a35da5e35eb856c29deff577c54210?