llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

[OpenMP][AMDGPU] hit assertion when exiting

Open ye-luo opened this issue 2 years ago • 3 comments

checked b0c4cd35df89479ec152c1f79e18d0264dd276cc reproducer code

$ clang++ -fopenmp --offload-arch=gfx906  target_taskwait.cpp && LIBOMPTARGET_DEBUG=1 OMP_TARGET_OFFLOAD=mandatory ./a.out
'+atomic-fadd-insts' is not a recognized feature for this target (ignoring feature)
'+atomic-fadd-insts' is not a recognized feature for this target (ignoring feature)
outside a = 0 addr 0x7ffd43530878
....
Target AMDGPU RTL --> Finalizing the AMDGPU DeviceInfo.
a.out: tpp.c:82: __pthread_tpp_change_priority: Assertion `new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)' failed.
Aborted (core dumped)

ye-luo avatar Aug 12 '22 20:08 ye-luo

@llvm/issue-subscribers-openmp

llvmbot avatar Aug 12 '22 21:08 llvmbot

CI still green, oddly. Wonder if that's running with asserts disabled https://lab.llvm.org/buildbot/#/builders/193

JonChesterfield avatar Aug 15 '22 20:08 JonChesterfield

It is also possible that tests under CI doesn't cover the reproducer case.

ye-luo avatar Aug 15 '22 20:08 ye-luo

@JonChesterfield does the reproducer fail on your machine?

ye-luo avatar Aug 17 '22 15:08 ye-luo

@JonChesterfield @jhuber6 Any insights

ye-luo avatar Aug 23 '22 16:08 ye-luo

~~dlopen libhsa build has this issue. if the plugin is built agaist libhsa directly, no problem.~~ When my test pass in certain scenarios, strace shows pulling libomptarget.rtl.amdgpu.so from rocm/5.2.0. Once I fix the LD_LIBRARY_PATH. The issue is reproducible on any machines.

ye-luo avatar Aug 24 '22 22:08 ye-luo

CI still green, oddly. Wonder if that's running with asserts disabled https://lab.llvm.org/buildbot/#/builders/193

@JonChesterfield @ronlieb could you check if the desired libomptarget.rtl.amdgpu.so gets picked during test-openmp.

ye-luo avatar Aug 27 '22 23:08 ye-luo

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff693e7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff692e3fa in __assert_fail_base (fmt=0x7ffff6ab56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust", 
    file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433, 
    function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:92
#3  0x00007ffff692e472 in __GI___assert_fail (
    assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust", 
    file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433, 
    function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:101
#4  0x00007ffff6cf8fa3 in __pthread_mutex_lock_full (mutex=0x5555557ae7c0) at ../nptl/pthread_mutex_lock.c:433
#5  0x00007fffed5b2759 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#6  0x00007fffed5ccedc in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#7  0x00007fffed5ccf49 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#8  0x00007fffed5d9942 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#9  0x00007fffeda1e475 in RTLDeviceInfoTy::~RTLDeviceInfoTy() ()
   from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#10 0x00007fffeda1ee75 in __tgt_rtl_deinit_plugin () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#11 0x00007ffff712e9d4 in deinit() () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.so.16git
#12 0x00007ffff7de3d13 in _dl_fini () at dl-fini.c:138
#13 0x00007ffff6941031 in __run_exit_handlers (status=0, listp=0x7ffff6ce9718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#14 0x00007ffff694112a in __GI_exit (status=<optimized out>) at exit.c:139
#15 0x00007ffff691fc8e in __libc_start_main (main=0x555555554af0 <main>, argc=1, argv=0x7fffffff85c8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff85b8) at ../csu/libc-start.c:344
#16 0x0000555555554a0a in _start ()

ye-luo avatar Aug 28 '22 00:08 ye-luo

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff693e7f1 in __GI_abort () at abort.c:79
#2  0x00007ffff692e3fa in __assert_fail_base (fmt=0x7ffff6ab56c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust", 
    file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433, 
    function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:92
#3  0x00007ffff692e472 in __GI___assert_fail (
    assertion=assertion@entry=0x7ffff6d03830 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust", 
    file=file@entry=0x7ffff6d03775 "../nptl/pthread_mutex_lock.c", line=line@entry=433, 
    function=function@entry=0x7ffff6d038e0 <__PRETTY_FUNCTION__.8935> "__pthread_mutex_lock_full") at assert.c:101
#4  0x00007ffff6cf8fa3 in __pthread_mutex_lock_full (mutex=0x5555557ae7c0) at ../nptl/pthread_mutex_lock.c:433
#5  0x00007fffed5b2759 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#6  0x00007fffed5ccedc in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#7  0x00007fffed5ccf49 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#8  0x00007fffed5d9942 in ?? () from /scratch2/packages/rocm/rocm-5.2.0/lib/libhsa-runtime64.so.1
#9  0x00007fffeda1e475 in RTLDeviceInfoTy::~RTLDeviceInfoTy() ()
   from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#10 0x00007fffeda1ee75 in __tgt_rtl_deinit_plugin () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.rtl.amdgpu.so
#11 0x00007ffff712e9d4 in deinit() () from /scratch2/packages/llvm/master-nightly/lib/libomptarget.so.16git
#12 0x00007ffff7de3d13 in _dl_fini () at dl-fini.c:138
#13 0x00007ffff6941031 in __run_exit_handlers (status=0, listp=0x7ffff6ce9718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#14 0x00007ffff694112a in __GI_exit (status=<optimized out>) at exit.c:139
#15 0x00007ffff691fc8e in __libc_start_main (main=0x555555554af0 <main>, argc=1, argv=0x7fffffff85c8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff85b8) at ../csu/libc-start.c:344
#16 0x0000555555554a0a in _start ()

Do you know if this is a recent problem? We changed the lifetime of when this destructor is called primarily so we can still call the user's destructors on the GPU. It's possible that it's outliving something it has a reference to. The last change was done in this patch.

jhuber6 avatar Aug 28 '22 00:08 jhuber6

Was this fixed by https://github.com/llvm/llvm-project/commit/292cb114b0a35da5e35eb856c29deff577c54210?

jhuber6 avatar Sep 17 '22 12:09 jhuber6

Was this fixed by 292cb11?

Yes

ye-luo avatar Sep 17 '22 16:09 ye-luo