ROCm-Device-Libs Request: Extended hsa signal support

Hi, as stated here HSA signals are supported at a minimum via "ockl". Are there plans to extend the signal support? In particular I would be interested in the "hsa_signal_wait" operations.

May 14 '19 07:05 koerberm

Could you provide any information on this @skeelyamd?

May 28 '19 07:05 koerberm

Adding @b-sumner.

Signal wait from a kernel is a little tricky. There is no way to release shader resources while waiting and you have to be careful of forward progress and concurrency rules. In general waiting on signal stores that depend on the same GPU can be problematic (ie deadlock). Waiting on stores from a different GPU or the CPU won't deadlock. But you have to be certain that you don't implicitly wait on an action which must run on the waiting GPU.

May 28 '19 21:05 skeelyamd

I agree with @skeelyamd , there are a lot of situations that won't work as expected for a quite a variety of reasons. What are some use cases you have in mind? How may work-items in a wavefront are likely to be waiting at the same time and on how many signals? How will you ensure that any data the signal is announcing is actually properly visible to the waiting threads?

May 28 '19 22:05 b-sumner

I did forget to mention the memory model issues. Signals are always in system fine grain memory so can be system scope acq/rel from within a kernel. Coarse grain memory can't do system scope acq/rel from a kernel (any system scope acq/rel fence you write in a kernel is weakened to agent scope for coarse grain allocations). The only data you can acq/rel within a kernel is data that has been placed in a fine grain allocation.

This isn't to say that we shouldn't add signal wait APIs. Just that using them correctly requires being rather careful with a lot of low level details that most language layers abstract away.

May 28 '19 22:05 skeelyamd

Thanks for your quick replies @skeelyamd, @b-sumner. I am currently working on a prototype for a stream processing engine (like Apache Flink). The goal is to place certain operators on the CPU while others on the GPU. In particular I want to use iGPUS (e.g., Raven) as accelerator for certain operations. Data items are passed between those operators via queues and I want to use signals for announcing new data. The advantage of signals would be that I do not have to explicitly schedule a kernel if new data is available, but launch the kernel once and let it wait for new data via signal. Since I focus on iGPUs, the memory regions accessed will always be fine grained. Further, I could assure that the GPU always waits for stores from the CPU and vice versa.

However, in an earlier version, I wrote the GPU operations directly in HSAIL (which provides a signal wait operation) and used the (closed source) finalizer to compile the code. This worked for Kaveri APUs, but since the finalizer was removed from the ROCm stack I tried to port my implementations to OpenCL using HSA extensions and built-ins. From this perspective, I thought it would be relatively easy to provide such an operation in the device libs.

May 29 '19 07:05 koerberm

If I understand correctly, after finishing the current batch of work, your kernel would

work-group-barrier
if ID == 0
    wait-for-signal
work-group-barrier

Is this correct?

For the time being, would you be OK with implementing your wait-for-signal as

while true
    value = __ockl_signal_load(sig, memorder);
    if value indicates ready
        break
    __builtin_amdgcn_s_sleep(1)

May 29 '19 19:05 b-sumner

Hi @b-sumner

this is fine by me for now. The results I achieve with this method look promising. However, it would be good to know if you plan to extend the signal support in the future. Are there any plans to bring this feature to the library?

May 31 '19 12:05 koerberm

The priority for this is low currently and there are plenty of other things to do. The implementation itself appears relatively straightforward, but a thorough set of tests that give high confidence in proper operation in most circumstances is required.

May 31 '19 14:05 b-sumner

Follow-up question on this topic: how does one efficiently wait on a host-written signal from the device? Currently, a sleep-check loop causes such high GPU utilization that subsequent hsa_executable_freeze calls hang (at least on my gfx902 APU), even when sleeping at the longest sleep duration. Is it possible to listen for a host-initiated interrupt on the device? Or is there some other way to implement a more efficient wait loop?

Mar 16 '22 20:03 jpsamaroo

Are you sure it is hanging because of high load? Does it even see the signal the host is posting?

Mar 16 '22 21:03 b-sumner

@b-sumner yes, and it isn't just hsa_executable_freeze; the progress of kernels submitted (on the same queue) after the high-load kernel has started can basically just come to a halt. I've confirmed that the high-load task is reading the signals correctly (and reordering the high-load kernel to execute after the signals have already been written works fine), and that writing of signals work properly (they're just kernel completion signals anyway).

Mar 18 '22 14:03 jpsamaroo

There continue to be no plans to wait for HSA signals from device code. Given the age of this issue and the availability of workarounds, I will close this.

Mar 28 '23 01:03 b-sumner

ROCm-Device-Libs ROCm-Device-Libs copied to clipboard

Request: Extended hsa signal support

ROCm-Device-Libs
ROCm-Device-Libs copied to clipboard