AMDGPU.jl
AMDGPU.jl copied to clipboard
Ensure killqueue!() updated global QUEUES list
Previously, a call to killqueue()
would not clean up the global QUEUES
list. On my device, a new call to Queue()
would reuse the same queue pointer, and this would cause Queue()
to throw an error since it detected duplicate queue pointers in the QUEUES
list. This was causing tests related to queues to fail.
This PR ensures killqueue()
will clean up the QUEUES
list.
It also adds documentation and cleanup to kill_queues()
.
Please see latest commits. I've centralised all ROCQueue
cleanup in kill_queue!()
, and the finalizer for ROCQueue now simply calls out to kill_queue()
. This ensures cleanup happens correctly either when ROCQueue is GC'ed or the queue encounters an error.
However, I'm concerned about your comment regarding the queue pointer being nulled during this function, and I don't know how to handle that without knowing more.
Ok, here's all the mechanisms in play that I think we need to consider:
- Kernel launch needs to query and access the queue, but can yield indefinitely due to the queue becoming full, so it's possible that the queue has entered an error (not yet dead) state by the time we yield back. We therefore need a fast check for queue validity before querying the queue (which we have), and one just before we write our packet to the queue (which we do not yet have). Also, we implicitly preserve the
ROCQueue
object during launch, so we shouldn't ever see the finalizer concurrently run with kernel launches (and more importantly, the queue pointer should not be re-used during that time). - The queue error monitor (with HSA calling
queue_error_handler
and Julia subsequently returning from https://github.com/JuliaGPU/AMDGPU.jl/blob/287712e571cbac0c58db3debe702cff50ed95f6e/src/queue.jl#L55) can be triggered at any time, and does not indicate that the queue is destroyed; that only happens onceHSA.queue_destroy
is called. N.B. We cannot take locks inqueue_error_handler
, as it runs on another thread. - The queue error monitor queries
QUEUES
as a means to prevent the task from inadvertently preserving theROCQueue
object beyond its useful lifetime. Therefore, the lifetime ofQUEUES
entries must be explicitly managed to correlate with the lifetime of the associatedROCQueue
object; the most obvious way to do this is to only allow the finalizer to delete the slot for its queue.
Also, very importantly:
-
HSA.queue_destroy
means that the queue is truly dead and its pointer can be re-used, so we better not have that pointer be accessed anywhere by the time that call happens.
One situation that causes this issue and is present in the tests as well is the following:
- We are launching kernel and it also creates default queue which is added both to
QUEUES
&DEFAULT_QUEUES
. - While waiting on it, this kernel timeouts, sending
SignalTimeoutException
. We kill the respective queue, which removes it fromDEFAULT_QUEUES
, but notQUEUES
. - Then another test creates several queues in a row, failing on the first, because
HSA.create_queue
created queue with the same pointer as inQUEUES
. We fail. - Only sometime after the finalizer is called for the queue created at 1. step, removing it from
QUEUES
.
So we kill queue, but it is not removed from QUEUES
until finalizer is called.
Is there a situation where we might need hard killed queue to remain in QUEUES
?
If not, then it is probably safe to remove it from QUEUES
as we kill it.
Here's example from the logs (hard kill indicating that HSA.queue_destroy
was called)
[ Info: [queue] Getting default queue
[ Info: [queue] creation
[ Info: [queue] Getting default queue
[ Info: [queue] killing queue, force=false, active=true
[ Info: [queue] hard kill
[ Info: [queue] creation
Priorities: Error During Test at /home/pxl-th/.julia/dev/AMDGPU/test/hsa/queue.jl:2
Got exception outside of a @test
AssertionError: !(haskey(QUEUES, queue.queue))
Stacktrace:
[1] (::AMDGPU.Runtime.var"#19#23")()
@ AMDGPU.Runtime ~/.julia/dev/AMDGPU/src/queue.jl:51
[2] lock(f::AMDGPU.Runtime.var"#19#23", l::ReentrantLock)
@ Base ./lock.jl:229
[3] ROCQueue(device::ROCDevice; priority::Symbol)
@ AMDGPU.Runtime ~/.julia/dev/AMDGPU/src/queue.jl:50
[4] macro expansion
@ ~/.julia/dev/AMDGPU/test/hsa/queue.jl:6 [inlined]
[5] macro expansion
@ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
[6] macro expansion
@ ~/.julia/dev/AMDGPU/test/hsa/queue.jl:3 [inlined]
[7] macro expansion
@ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
[8] top-level scope
@ ~/.julia/dev/AMDGPU/test/hsa/queue.jl:2
[9] include(fname::String)
@ Base.MainInclude ./client.jl:478
[10] macro expansion
@ ~/.julia/dev/AMDGPU/test/runtests.jl:61 [inlined]
[11] macro expansion
@ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
[12] macro expansion
@ ~/.julia/dev/AMDGPU/test/runtests.jl:58 [inlined]
[13] macro expansion
@ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
[14] top-level scope
@ ~/.julia/dev/AMDGPU/test/runtests.jl:54
[15] include(fname::String)
@ Base.MainInclude ./client.jl:478
[16] top-level scope
@ none:6
[17] eval
@ ./boot.jl:370 [inlined]
[18] exec_options(opts::Base.JLOptions)
@ Base ./client.jl:280
[19] _start()
@ Base ./client.jl:522
Test Summary: | Pass Error Total Time
AMDGPU | 2 1 3 19.5s
HSA | 2 1 3 19.5s
HSA Status Error | 1 1 0.0s
HSA Async Queue Error | 1 1 18.4s
Queues | 1 1 0.7s
Priorities | 1 1 0.7s
Device Functions | None 0.0s
ERROR: LoadError: Some tests did not pass: 2 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/pxl-th/.julia/dev/AMDGPU/test/runtests.jl:49
[ Info: [queue] Finalizer
Closing as this seems to be fixed in #306; thanks for looking into this @torrance !