aomp icon indicating copy to clipboard operation
aomp copied to clipboard

device plugin dead-lock at a atmi_malloc failure

Open ye-luo opened this issue 4 years ago • 7 comments

On my apu laptop when graphic memory is set low, the memory allocation failure caused a deadlock in the device plugin.

[/home/estewart/git/aomp11/amd-llvm-project/openmp/libomptarget/plugins/hsa/impl/data.cpp:99] atmi_malloc failed: HSA_STATUS_ERROR_INVALID_ALLOCATION

backtrace

__lll_lock_wait (futex=futex@entry=0xe446d8, private=0) at lowlevellock.c:52
52  lowlevellock.c: No such file or directory.
(gdb) bt
#0  __lll_lock_wait (futex=futex@entry=0xe446d8, private=0) at lowlevellock.c:52
#1  0x00007f5d95fd20a3 in __GI___pthread_mutex_lock (mutex=0xe446d8) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5d96008caf in __gthread_mutex_lock (__mutex=0xe446d8) at /usr/include/x86_64-linux-gnu/c++/7/bits/gthr-default.h:748
#3  0x00007f5d9600a74a in std::mutex::lock (this=0xe446d8) at /usr/include/c++/7/bits/std_mutex.h:103
#4  0x00007f5d96016ae2 in RTLsTy::UnregisterLib (this=0xe38540, desc=0x4913e0 <omp_offloading.descriptor>) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/src/rtl.cpp:442
#5  0x00007f5d9601260a in __tgt_unregister_lib (desc=0x4913e0 <omp_offloading.descriptor>) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/src/interface.cpp:86
#6  0x00007f5d975d4f5b in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007f5d95e1da27 in __run_exit_handlers (status=1, listp=0x7f5d95fbf718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#8  0x00007f5d95e1dbe0 in __GI_exit (status=<optimized out>) at exit.c:139
#9  0x00007f5d5558ab3b in core::Runtime::Malloc (ptr=0x7ffd557ca6c0, size=2383275008, place=...) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/plugins/hsa/impl/data.cpp:99
#10 0x00007f5d55586819 in atmi_malloc (ptr=0x7ffd557ca6c0, size=2383275008, place=...) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/plugins/hsa/impl/atmi.cpp:50
#11 0x00007f5d555d43f2 in __tgt_rtl_load_binary_locked (device_id=0, image=0x4913c0 <omp_offloading.device_images>) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/plugins/hsa/src/rtl.cpp:1018
#12 0x00007f5d555d3dfb in __tgt_rtl_load_binary (device_id=0, image=0x4913c0 <omp_offloading.device_images>) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/plugins/hsa/src/rtl.cpp:935
#13 0x00007f5d9600fa14 in DeviceTy::load_binary (this=0xe44600, Img=0x4913c0 <omp_offloading.device_images>) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/src/device.cpp:340
#14 0x00007f5d96023238 in InitLibrary (Device=...) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/src/omptarget.cpp:96
#15 0x00007f5d96023a14 in CheckDeviceAndCtors (device_id=0) at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/src/omptarget.cpp:205
#16 0x00007f5d96012707 in __tgt_target_data_begin (device_id=0, arg_num=1, args_base=0x7ffd557ca9d8, args=0x7ffd557ca9d0, arg_sizes=0x7ffd557ca9c8, arg_types=0x45cd10)
    at /usr/lib/aomp_11.8-0/lib-debug/src/openmp/libomptarget/src/interface.cpp:105
#17 0x0000000000413cdf in qmcplusplus::Vector<multi_UBspline_3d_d*, qmcplusplus::OMPallocator<multi_UBspline_3d_d*, std::allocator<multi_UBspline_3d_d*> > >::resize(unsigned long, multi_UBspline_3d_d*)
    ()
#18 0x0000000000414211 in qmcplusplus::einspline_spo_omp<double>::set(int, int, int, int, int, bool) ()
#19 0x000000000040680f in main ()

it seems like the device initialization failure caused a dead-lock on mutex.

ye-luo avatar Oct 05 '20 14:10 ye-luo

I remember looking into this. If a function in the plugin calls exit(), that calls UnregisterLib. However, UnregisterLib takes the same mutex lock as InitLibrary. That is, the plugin cannot safely exit, at least on some code paths.

Unfortunately that constraint was not anticipated, and the library does a lot of 'on error, exit', all of which need to be methodically stripped out to produce a robust implementation. This will be iterative.

The host runtime probably shouldn't deadlock if a plugin exits. I'm not sure how invasive fixing that would be.

JonChesterfield avatar Oct 15 '20 14:10 JonChesterfield

This is both bug and enhancement. The enhancement is to check resources during initialization and warn if there is a problem The other enhancement is to test on an APU. The bug is the deadlock. We are moving away from atmi-malloc. So maybe that will resolve the bug. I am going to keep this ticket on my plate to decide later what to do.

Any hints on easily recreating this would be appreciated.

gregrodgers avatar Oct 26 '20 13:10 gregrodgers

The deadlock repros if anything calls exit() during load_binary. Fix probably involves never calling exit or abort (or throw).

The out of memory can probably be triggered reliably on a gfx906ish by increasing MAX_SM until the structure no longer fits, or by allocating some factor over the size of the structure in the runtime.

It's not totally clear how to handle out of memory. Sometimes allocating from system instead is better than failing. We may want different behaviour for APU vs GPU.

JonChesterfield avatar Oct 28 '20 01:10 JonChesterfield

Do we have a smoke test to recreate this? Is this still deadlocking with AOMP 13.0-2?

gregrodgers avatar Apr 20 '21 12:04 gregrodgers

I set 512MB GPU memory in my laptop BIOS and I can still reproduce the issue with rocm 4.1.0. The call stack depth seems being reduced.

(gdb) bt
#0  __lll_lock_wait (futex=futex@entry=0x183d898, private=0) at lowlevellock.c:52
#1  0x00007f0bd30d30a3 in __GI___pthread_mutex_lock (mutex=0x183d898) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f0bd3113d1a in RTLsTy::UnregisterLib(__tgt_bin_desc*) () from /opt/rocm/llvm/lib/libomptarget.so
#3  0x00007f0bd4661f5b in ?? () from /lib64/ld-linux-x86-64.so.2
#4  0x00007f0bd2f1ea27 in __run_exit_handlers (status=1, listp=0x7f0bd30c0718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#5  0x00007f0bd2f1ebe0 in __GI_exit (status=<optimized out>) at exit.c:139
#6  0x00007f0b92739e2d in core::Runtime::Malloc(void**, unsigned long, atmi_mem_place_s) ()
   from /opt/rocm/llvm/lib/libomptarget.rtl.amdgpu.so
#7  0x00007f0b927526a7 in __tgt_rtl_load_binary () from /opt/rocm/llvm/lib/libomptarget.rtl.amdgpu.so
#8  0x00007f0bd310b477 in DeviceTy::load_binary(void*) () from /opt/rocm/llvm/lib/libomptarget.so
#9  0x00007f0bd31187b0 in CheckDeviceAndCtors(long) () from /opt/rocm/llvm/lib/libomptarget.so
#10 0x00007f0bd310f512 in __tgt_target_data_begin_mapper () from /opt/rocm/llvm/lib/libomptarget.so
#11 0x0000000000413bb4 in qmcplusplus::Vector<multi_UBspline_3d_d*, qmcplusplus::OMPallocator<multi_UBspline_3d_d*, qmcplusplus::Mallocator<multi_UBspline_3d_d*, 32ul> > >::resize(unsigned long, multi_UBspline_3d_d*) ()
#12 0x00000000004140f1 in qmcplusplus::einspline_spo_omp<double>::set(int, int, int, int, int, bool) ()
#13 0x00000000004066e5 in main ()

ye-luo avatar Apr 21 '21 01:04 ye-luo

I'm told of a related hazard. If the amdgpu plugin is called, and finds no amdgpu, it may call exit. If it does, a system containing multiple plugins that tries amdgpu first, can give up before it tries the later ones. That bug is latent outside of aomp, but prudent to fix before it becomes active elsewhere.

JonChesterfield avatar Apr 26 '21 14:04 JonChesterfield

This, and a variety of other unhappy-path bugs, is expected to be fixed by https://reviews.llvm.org/D102346

JonChesterfield avatar May 12 '21 18:05 JonChesterfield

From JonC who no longer can access aomp github: Believed fixed by D102346

ronlieb avatar Mar 30 '23 13:03 ronlieb