ROCm-OpenCL-Runtime
ROCm-OpenCL-Runtime copied to clipboard
[RFC] atomic operation support on 32-bit floating point
Would like to ideally have atomic_add(); I'm assuming the hardware supports resolving conflicts.
As far as I know, AMD GPUs do not support atomic floating point math operations. We support atomic_add() on integer types, but currently do not natively support them on FP types.
That being said, you can emulate such support by using an atomic_cmpxchg loop. See, for instance, this implementation of atomic_fp_add() in clSPARSE. The general idea is to load the value from memory, do the addition locally, and then attempt to do a cmpxchg() to see if the operation would have completed without interference -- if not, the cmpxchg() fails and you try again.
Note that this can hurt performance pretty badly when there is a lot of contention on a variable. The critical path (global load, FP add, global atomic cmpxchg) is very long, so contended variables can end up spin-looping for quite a while. However, we've found that it performs well enough in the lightly-contended cases we've seen in clSPARSE.
Thanks for the feedback. I'm familiar with the cmpxchg-based implementation as an option, but as you note this will cause pretty significant contention -- and in many algorithms as you strong-scale the contention gets worse. Additionally, even in kernels with moderate contention that are latency-bound, not being able to just issue a single instruction and have the hardware resolve conflicts without stalling the instruction stream is an issue.
cl_khr_int64_base_atomics is supported on current hardware?
Yes, 64-bit integer atomics (bot cl_khr_int64_base_atomics and cl_khr_int64_extended_atomics) are supported on every GPU that ROCm supports, as far as I know. We found some platforms from other vendors that did not support 64-bit atomics in OpenCL, which is why the clSPARSE code I linked includes a lot of ifdef code to do workarounds in the case that 64-bit integer atomics were not supported. This was because we were striving to write code that worked for everyone, not just AMD hardware.
Note that, like any 64-bit operation, 64-bit atomics may be slower than 32-bit, by various amounts, depending on the hardware.
Rest assured, your request for hardware support has been noted (by me, anyway). I have a few software cases (e.g. in my clSPARSE code) that would benefit from this support as well. Hardware design decisions, of course, need to trade off added design complexity, verification overhead, hardware area, and power with the possibility of making some meaningful amount of code faster by a useful amount.
I cannot speak for whether or not future AMD hardware will have this support, but rest assured, we're not ignoring you. :)
In addition, if you have more information about the application(s) you believe would benefit from this (and any rough estimates about how much you believe they would benefit), that can't hurt to include.
Thanks a lot for the feedback!
One last thing I was wondering maybe you can help me with or point me to the information about. I assume atomic ops handled in hardware (on/close to L2 like other vendors)? For atomics on supported data types what is the throughput for collision free and colliding updates? What about the latency?