ROCR-Runtime
ROCR-Runtime copied to clipboard
Implement timeout in case of unresponsive GPU
Some faulty kernels can cause the GPU to act not as expected by the runtime (such as answering in a reasonable amount of time).
Currently, the runtime will loop until the GPU responds:
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/c5f95f9b33af2aa1dd1e6ba76b18cd2e291f3c7d/src/core/runtime/amd_aql_queue.cpp#L930
This is related to https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/96 (infinite loop in ExecutePM4
).
What you are proposing is like making pthread_mutex_lock timeout after some arbitrary time to workaround a deadlock in an application. Kernels can run for an arbitrary length of time. They can even run "forever" while waiting for signals from the host or another GPU. It is not for ROCr to decide whether a signal is not coming because the program is broken or because it needs more time. If the timeout is due to a GPU hang, it will be detected by the kernel mode driver and the GPU will be reset, which usually terminates the application. It the timeout is due to a deadlock in the application, that needs to be debugged and addressed in the application, not in ROCr.
Finally, if the timeout is due to a deadlock in the OpenCL or HIP language runtime code, it needs to be addressed there.
Thank you for your answer, it is very much appreciated.
My suggestion seems a little bit silly when taking into account your informed point of view.
The context of this infinite loop behaviour is the following: It seemed that s_call_b64
using an unavailable register pair causes the GPU to turn unresponsive.
Just to emphasis the back trace:
#1 amd::AqlQueue::ExecutePM4(...)
#2 amd::GpuAgent::InvalidateCodeCaches()
#3 amd::LoaderContext::SegmentAlloc(...)
#4 amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(...)
#5 amd::hsa::loader::ExecutableImpl::LoadSegments(...)
#6 amd::hsa::loader::ExecutableImpl::LoadCodeObject(...
#7 HSA::hsa_executable_load_agent_code_object(...)
------------ ^^^ ROCr ^^^^ ------------
#8 roc::LightningProgram::setKernels(...)
#9 device::Program::linkImplLC(...)
#10 device::Program::build(...)
#11 amd::Program::build(...)
------------ ^^^ ROCclr ^^^ ------------
#12 PlatformState::getFunc(...)
#13 hipLaunchKernel(...)
------------ ^^^ hipamd ^^^ ------------
Other remarks:
-
ExecutePM4
could only be accessed by callingLoadCodeObject
-
LoadCodeObject
is only called inhsa_executable_load_program_code_object
(and in an other deprecated function) -
SegmentAlloc
is used to allocate memory to contain the program code. So this blocking behaviour occurs before the kernel launch command is enqueued.
I am not familiar with the data structures of ROCr, but it doesn't seem it is related to a user apparent lock or signal. It is internal to ROCr that is waiting for the Invalid Cache PM4 packet to be processed, which never happens, causing the user application to pause.
What can we do to avoid the user to turn perplexed ? Is that a good idea to print some message every now and then to let the user detect it ?
Probably ROCr isn't handling the HW_EXCEPTION events KFD sends to notify it of the GPU reset. That's assuming that GPU reset is enabled when KFD detects a hang. You should see some evidence of that in the kernel log (dmesg).