rocprofiler icon indicating copy to clipboard operation
rocprofiler copied to clipboard

Intercept mode deadlocks with multiple threads driving separate GPUs

Open gcongiu opened this issue 1 year ago • 4 comments

Running intercept_multi_thread_monitoring test in papi (located in papi/src/components/rocm/tests/intercept_multi_thread_monitoring) deadlocks. The test runs a matrix-to-matrix multiplication kernel and uses multiple threads, each driving and monitoring its own dedicated GPU. The backtrace from rocgdb follows:

#0  0x00007ffff599e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff5999eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2  0x000000010440c741 in ?? ()
#3  0x00007ffeeb9999d1 in rocprofiler::util::HsaRsrcFactory::GetKernelNameRef(unsigned long) () from /opt/rocm-5.5.0/lib/librocprofiler64.so
#4  0x00007ffeeb982f33 in rocprofiler::InterceptQueue::QueryKernelName(unsigned long, amd_kernel_code_s const*) ()
   from /opt/rocm-5.5.0/lib/librocprofiler64.so
#5  0x00007ffeeb9898e5 in rocprofiler::InterceptQueue::OnSubmitCB(void const*, unsigned long, unsigned long, void*, void (*)(void const*, unsigned long)) () from /opt/rocm-5.5.0/lib/librocprofiler64.so
#6  0x00007fffece1aafa in rocr::core::InterceptQueue::StoreRelaxed(long) () from /opt/rocm-5.5.0/lib/libhsa-runtime64.so.1
#7  0x00007fffece0d9a8 in rocr::HSA::hsa_signal_store_screlease(hsa_signal_s, long) () from /opt/rocm-5.5.0/lib/libhsa-runtime64.so.1
#8  0x00007ffff66850c4 in bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#9  0x00007ffff66828d7 in roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*, hsa_kernel_dispatch_packet_s*) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#10 0x00007ffff6683638 in roc::VirtualGPU::submitKernel(amd::NDRangeKernelCommand&) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#11 0x00007ffff6654d1a in amd::Command::enqueue() () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#12 0x00007ffff657b723 in ihipModuleLaunchKernel(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, unsigned int) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#13 0x00007ffff65a27a7 in ihipLaunchKernel(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, int) ()
   from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#14 0x00007ffff657b5a2 in hipLaunchKernel_common () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#15 0x00007ffff6589e12 in hipLaunchKernel () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#16 0x0000000000473597 in __device_stub__matmul(float*, float*, float*, int) ()
#17 0x0000000000473aa5 in hip_do_matmul_work (handle=0xa88530, stream=0xa1f610) at matmul.cpp:111
#18 0x00000000004730fc in .omp_outlined._debug__(int &, const char *(&)[4], int &, hipError_t &) const (.global_tid.=0x7fffffffc670,
    .bound_tid.=0x7fffffffc668, papi_errno=@0x7fffffffca6c: 0, events=..., pass_with_warning=@0x7fffffffca68: 0,
    hip_errno=@0x7fffffffca64: hipSuccess) at multi_thread_monitoring.cpp:103
#19 0x00000000004734cd in .omp_outlined.(void) const (.global_tid.=0x7fffffffc670, .bound_tid.=0x7fffffffc668, papi_errno=@0x7fffffffca6c: 0,
    events=..., pass_with_warning=@0x7fffffffca68: 0, hip_errno=@0x7fffffffca64: hipSuccess) at multi_thread_monitoring.cpp:63
#20 0x00007ffff7fcbf43 in __kmp_invoke_microtask () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#21 0x00007ffff7f5177f in __kmp_invoke_task_func () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#22 0x00007ffff7f4b85b in __kmp_fork_call () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#23 0x00007ffff7f3cb05 in __kmpc_fork_call () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#24 0x0000000000472da2 in multi_thread (argc=1, argv=0x7fffffffcb88) at multi_thread_monitoring.cpp:63
#25 0x00000000003eaf80 in main (argc=1, argv=0x7fffffffcb88) at intercept_multi_thread_monitoring.cpp:13

The version of ROCm used to reproduce this problem is 5.5.0 RC5. Test was ran on two MI210s.

gcongiu avatar May 17 '23 10:05 gcongiu

I verified this with rocm-5.5.0 stable release and the problem is also present there.

gcongiu avatar May 18 '23 07:05 gcongiu

@ammarwa any update on this?

gcongiu avatar Jun 09 '23 08:06 gcongiu

Added reproducer issue-113.tar.gz

gcongiu avatar Jun 13 '23 18:06 gcongiu

@gcongiu Apologies for the lack of response. Can you please check if your issue still exists with the latest ROCm 6.2? If so, we will further investigate the issue. Thanks!

ppanchad-amd avatar Aug 28 '24 15:08 ppanchad-amd

Hi @gcongiu, I have not been able to reproduce this on ROCm 6.2 using 2 MI210s and the reproducer code you provided. Can you try upgrading to ROCm 6.2 and check if the issue persists?

sohaibnd avatar Oct 01 '24 16:10 sohaibnd