core::Signal::WaitAny never get the signal
I am trying to run OpenCL applications on the APU that comes with Ryzen 3400G (i.e., gfx902+xnack). Sometimes it will work and return me correct results, but sometimes it blocks forever. I traced it with gdb, and I find that the main thread is blocked at clCreateCommandQueue() waiting for events:
#1 do_futex_wait (sem=sem@entry=0x627148, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007ffff6f6c8d4 in __new_sem_wait_slow (sem=0x627148, abstime=0x0) at sem_waitcommon.c:181
#3 0x00007ffff6f6c97a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4 0x00007ffff688df60 in amd::Semaphore::wait() () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#5 0x00007ffff688dd4a in amd::Monitor::wait() () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#6 0x00007ffff68a3bb4 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority) ()
from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#7 0x00007ffff687fa56 in clCreateCommandQueueWithProperties () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#8 0x00007ffff687fd28 in clCreateCommandQueue () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#9 0x0000000000401193 in main ()
It is waiting for updates from another thread, which is thread 2 in gdb. The trace of thread 2 is like this:
#0 0x00007ffff7276f47 in ioctl () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff6330654 in kmtIoctl () from /usr/local/lib/libhsakmt.so.1
#2 0x00007ffff632a84f in hsaKmtWaitOnMultipleEvents () from /usr/local/lib/libhsakmt.so.1
#3 0x00007ffff65aa2a3 in core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) ()
from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#4 0x00007ffff6591516 in AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) ()
from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#5 0x00007ffff65a256a in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#6 0x00007ffff6561217 in os::ThreadTrampoline(void*) () from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#7 0x00007ffff6f646ba in start_thread (arg=0x7ffff5adf700) at pthread_create.c:333
#8 0x00007ffff728141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
If I run with HSA_ENABLE_INTERRUPT=0, I get the following instead:
#0 0x00007ffff66e6232 in core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) ()
from /opt/rocm/lib/libhsa-runtime64.so.1
#1 0x00007ffff66cd516 in AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) ()
from /opt/rocm/lib/libhsa-runtime64.so.1
#2 0x00007ffff66de56a in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3 0x00007ffff669d217 in os::ThreadTrampoline(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#4 0x00007ffff79ab6ba in start_thread (arg=0x7ffeef514700) at pthread_create.c:333
#5 0x00007ffff6c5841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Doing some printf tricks shows that the program is stuck at an infinite loop because it never receives the signal.
Originally I thought it was a problem of ROCT and so I posted an issue here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/issues/56. Following the suggestion, I run with HSA_ENABLE_SDMA=0 but it does not change anything.
I am quite sure that it is not directly related to OpenCL. I also tried to run HIP applications (compiled with an unofficial hipcc hacked by me) and a similar problem happens. The thread launching kernels gets blocked here:
#0 0x00007ffff6c3b827 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007ffff66b2c65 in amd::AqlQueue::ExecutePM4(unsigned int*, unsigned long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#2 0x00007ffff66a87e8 in amd::GpuAgent::InvalidateCodeCaches() () from /opt/rocm/lib/libhsa-runtime64.so.1
#3 0x00007ffff66b50be in amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) ()
from /opt/rocm/lib/libhsa-runtime64.so.1
#4 0x00007ffff66eed15 in amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, amd::hsa::code::AmdHsaCode const*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#5 0x00007ffff66ef40e in amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, amd::hsa::code::AmdHsaCode const*, unsigned int) ()
from /opt/rocm/lib/libhsa-runtime64.so.1
#6 0x00007ffff66f24f3 in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, hsa_loaded_code_object_s*) ()
from /opt/rocm/lib/libhsa-runtime64.so.1
#7 0x00007ffff66c73e7 in HSA::hsa_executable_load_agent_code_object(hsa_executable_s, hsa_agent_s, hsa_code_object_reader_s, char const*, hsa_loaded_code_object_s*) ()
from /opt/rocm/lib/libhsa-runtime64.so.1
#8 0x00007ffff742f845 in roc::LightningProgram::setKernels(amd::option::Options*, void*, unsigned long) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#9 0x00007ffff73d8d45 in device::Program::linkImplLC(amd::option::Options*) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#10 0x00007ffff73d9975 in device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#11 0x00007ffff73ec5fc in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool) ()
from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#12 0x00007ffff7377e15 in PlatformState::getFunc(void const*, int) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#13 0x00007ffff7380653 in hipLaunchKernel () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#14 0x0000000000400f8e in main ()
Thread 2 in gdb shows the same trace as OpenCL.
When I run HIP applications, sometimes it can pass this part but get blocked at hipDeviceSynchronize() instead. Thread 2 shows the exact same trace. However, if the first hipDeviceSynchronize() can pass, other hipDeviceSynchronize() always pass.
To conclude,
- Thread 2 is waiting for a signal, but it never receives it
- The main thread get blocked only because it does not get update from thread 2
- The same problem can happen for OpenCL or HIP, for creating command queue, kernel launch, or device sync
- The problem is random. Sometimes the program (both OpenCL and HIP) can finish and return me correct results
Any help in fixing the bug is appreciated. Thank you.
@Kelvin-Ng Apologies for the lack of response. Do you still need assistance with this ticket? If not, please close the ticket. Thanks!
@Kelvin-Ng Closing ticket. Please feel free to re-open ticket if you still need assistance. Thanks!