AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Ensure killqueue!() updated global QUEUES list

Open torrance opened this issue 2 years ago • 3 comments

Previously, a call to killqueue() would not clean up the global QUEUES list. On my device, a new call to Queue() would reuse the same queue pointer, and this would cause Queue() to throw an error since it detected duplicate queue pointers in the QUEUES list. This was causing tests related to queues to fail.

This PR ensures killqueue() will clean up the QUEUES list.

It also adds documentation and cleanup to kill_queues().

torrance avatar Aug 17 '22 03:08 torrance

Please see latest commits. I've centralised all ROCQueue cleanup in kill_queue!(), and the finalizer for ROCQueue now simply calls out to kill_queue(). This ensures cleanup happens correctly either when ROCQueue is GC'ed or the queue encounters an error.

However, I'm concerned about your comment regarding the queue pointer being nulled during this function, and I don't know how to handle that without knowing more.

torrance avatar Aug 19 '22 03:08 torrance

Ok, here's all the mechanisms in play that I think we need to consider:

  • Kernel launch needs to query and access the queue, but can yield indefinitely due to the queue becoming full, so it's possible that the queue has entered an error (not yet dead) state by the time we yield back. We therefore need a fast check for queue validity before querying the queue (which we have), and one just before we write our packet to the queue (which we do not yet have). Also, we implicitly preserve the ROCQueue object during launch, so we shouldn't ever see the finalizer concurrently run with kernel launches (and more importantly, the queue pointer should not be re-used during that time).
  • The queue error monitor (with HSA calling queue_error_handler and Julia subsequently returning from https://github.com/JuliaGPU/AMDGPU.jl/blob/287712e571cbac0c58db3debe702cff50ed95f6e/src/queue.jl#L55) can be triggered at any time, and does not indicate that the queue is destroyed; that only happens once HSA.queue_destroy is called. N.B. We cannot take locks in queue_error_handler, as it runs on another thread.
  • The queue error monitor queries QUEUES as a means to prevent the task from inadvertently preserving the ROCQueue object beyond its useful lifetime. Therefore, the lifetime of QUEUES entries must be explicitly managed to correlate with the lifetime of the associated ROCQueue object; the most obvious way to do this is to only allow the finalizer to delete the slot for its queue.

jpsamaroo avatar Sep 12 '22 17:09 jpsamaroo

Also, very importantly:

  • HSA.queue_destroy means that the queue is truly dead and its pointer can be re-used, so we better not have that pointer be accessed anywhere by the time that call happens.

jpsamaroo avatar Sep 12 '22 17:09 jpsamaroo

One situation that causes this issue and is present in the tests as well is the following:

  1. We are launching kernel and it also creates default queue which is added both to QUEUES & DEFAULT_QUEUES.
  2. While waiting on it, this kernel timeouts, sending SignalTimeoutException. We kill the respective queue, which removes it from DEFAULT_QUEUES, but not QUEUES.
  3. Then another test creates several queues in a row, failing on the first, because HSA.create_queue created queue with the same pointer as in QUEUES. We fail.
  4. Only sometime after the finalizer is called for the queue created at 1. step, removing it from QUEUES.

So we kill queue, but it is not removed from QUEUES until finalizer is called. Is there a situation where we might need hard killed queue to remain in QUEUES? If not, then it is probably safe to remove it from QUEUES as we kill it.

Here's example from the logs (hard kill indicating that HSA.queue_destroy was called)

[ Info: [queue] Getting default queue
[ Info: [queue] creation
[ Info: [queue] Getting default queue
[ Info: [queue] killing queue, force=false, active=true
[ Info: [queue] hard kill
[ Info: [queue] creation
Priorities: Error During Test at /home/pxl-th/.julia/dev/AMDGPU/test/hsa/queue.jl:2
  Got exception outside of a @test
  AssertionError: !(haskey(QUEUES, queue.queue))
  Stacktrace:
    [1] (::AMDGPU.Runtime.var"#19#23")()
      @ AMDGPU.Runtime ~/.julia/dev/AMDGPU/src/queue.jl:51
    [2] lock(f::AMDGPU.Runtime.var"#19#23", l::ReentrantLock)
      @ Base ./lock.jl:229
    [3] ROCQueue(device::ROCDevice; priority::Symbol)
      @ AMDGPU.Runtime ~/.julia/dev/AMDGPU/src/queue.jl:50
    [4] macro expansion
      @ ~/.julia/dev/AMDGPU/test/hsa/queue.jl:6 [inlined]
    [5] macro expansion
      @ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
    [6] macro expansion
      @ ~/.julia/dev/AMDGPU/test/hsa/queue.jl:3 [inlined]
    [7] macro expansion
      @ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
    [8] top-level scope
      @ ~/.julia/dev/AMDGPU/test/hsa/queue.jl:2
    [9] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [10] macro expansion
      @ ~/.julia/dev/AMDGPU/test/runtests.jl:61 [inlined]
   [11] macro expansion
      @ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
   [12] macro expansion
      @ ~/.julia/dev/AMDGPU/test/runtests.jl:58 [inlined]
   [13] macro expansion
      @ ~/bin/julia-latest/share/julia/stdlib/v1.9/Test/src/Test.jl:1493 [inlined]
   [14] top-level scope
      @ ~/.julia/dev/AMDGPU/test/runtests.jl:54
   [15] include(fname::String)
      @ Base.MainInclude ./client.jl:478
   [16] top-level scope
      @ none:6
   [17] eval
      @ ./boot.jl:370 [inlined]
   [18] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:280
   [19] _start()
      @ Base ./client.jl:522
Test Summary:             | Pass  Error  Total   Time
AMDGPU                    |    2      1      3  19.5s
  HSA                     |    2      1      3  19.5s
    HSA Status Error      |    1             1   0.0s
    HSA Async Queue Error |    1             1  18.4s
    Queues                |           1      1   0.7s
      Priorities          |           1      1   0.7s
  Device Functions        |               None   0.0s
ERROR: LoadError: Some tests did not pass: 2 passed, 0 failed, 1 errored, 0 broken.
in expression starting at /home/pxl-th/.julia/dev/AMDGPU/test/runtests.jl:49
[ Info: [queue] Finalizer

pxl-th avatar Sep 22 '22 21:09 pxl-th

Closing as this seems to be fixed in #306; thanks for looking into this @torrance !

jpsamaroo avatar Nov 08 '22 15:11 jpsamaroo