ROCR-Runtime icon indicating copy to clipboard operation
ROCR-Runtime copied to clipboard

Implement timeout in case of unresponsive GPU

Open v01dXYZ opened this issue 2 years ago • 5 comments

Some faulty kernels can cause the GPU to act not as expected by the runtime (such as answering in a reasonable amount of time).

Currently, the runtime will loop until the GPU responds:

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/c5f95f9b33af2aa1dd1e6ba76b18cd2e291f3c7d/src/core/runtime/amd_aql_queue.cpp#L930

v01dXYZ avatar Apr 28 '22 03:04 v01dXYZ

This is related to https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/96 (infinite loop in ExecutePM4).

v01dXYZ avatar May 15 '22 18:05 v01dXYZ

What you are proposing is like making pthread_mutex_lock timeout after some arbitrary time to workaround a deadlock in an application. Kernels can run for an arbitrary length of time. They can even run "forever" while waiting for signals from the host or another GPU. It is not for ROCr to decide whether a signal is not coming because the program is broken or because it needs more time. If the timeout is due to a GPU hang, it will be detected by the kernel mode driver and the GPU will be reset, which usually terminates the application. It the timeout is due to a deadlock in the application, that needs to be debugged and addressed in the application, not in ROCr.

Finally, if the timeout is due to a deadlock in the OpenCL or HIP language runtime code, it needs to be addressed there.

fxkamd avatar May 16 '22 14:05 fxkamd

Thank you for your answer, it is very much appreciated.

My suggestion seems a little bit silly when taking into account your informed point of view.

The context of this infinite loop behaviour is the following: It seemed that s_call_b64 using an unavailable register pair causes the GPU to turn unresponsive.

Just to emphasis the back trace:

#1 amd::AqlQueue::ExecutePM4(...)
#2 amd::GpuAgent::InvalidateCodeCaches()
#3 amd::LoaderContext::SegmentAlloc(...)
#4 amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(...)
#5 amd::hsa::loader::ExecutableImpl::LoadSegments(...)
#6 amd::hsa::loader::ExecutableImpl::LoadCodeObject(...
#7 HSA::hsa_executable_load_agent_code_object(...)
 ------------ ^^^ ROCr ^^^^ ------------ 
#8  roc::LightningProgram::setKernels(...)
#9  device::Program::linkImplLC(...)
#10 device::Program::build(...)
#11 amd::Program::build(...)
 ------------ ^^^ ROCclr ^^^ ------------ 
#12 PlatformState::getFunc(...)
#13 hipLaunchKernel(...)
 ------------ ^^^ hipamd ^^^ ------------ 

Other remarks:

  • ExecutePM4 could only be accessed by calling LoadCodeObject
  • LoadCodeObject is only called in hsa_executable_load_program_code_object (and in an other deprecated function)
  • SegmentAlloc is used to allocate memory to contain the program code. So this blocking behaviour occurs before the kernel launch command is enqueued.

I am not familiar with the data structures of ROCr, but it doesn't seem it is related to a user apparent lock or signal. It is internal to ROCr that is waiting for the Invalid Cache PM4 packet to be processed, which never happens, causing the user application to pause.

What can we do to avoid the user to turn perplexed ? Is that a good idea to print some message every now and then to let the user detect it ?

v01dXYZ avatar May 16 '22 22:05 v01dXYZ

Probably ROCr isn't handling the HW_EXCEPTION events KFD sends to notify it of the GPU reset. That's assuming that GPU reset is enabled when KFD detects a hang. You should see some evidence of that in the kernel log (dmesg).

fxkamd avatar May 17 '22 14:05 fxkamd