Access to cross-lane operations with OpenCL extensions
Hi,
Intel has a very useful extension: cl_intel_subgroups
Which enables inside a subgroup (a wavefront) to shuffle items, do reduce operations, etc.
According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ Recent AMD hardware can do the same, and even better.
I know this functionnality is available via HSA or inline assembly, but there is no OpenCL extension supported by AMD for that. Assembly is not a good solution for an OpenCL developper, as the assembly might need to be updated for new cards or for bug workarounds. Please make it an extension !
Features I'd like to have: shuffle, fine grained reduction operations. For example reduction among work items 0, 8, 16, etc, and 1, 9, 17, etc you get the idea, or reduction among 0-7, 8-15, etc. This type of fine grained reduction would be very useful. Going through LDS is possible, but for a reduction operation, you need several lds reads, and using the cross lane operations would be much faster.
There is a amd extention for that if I recal it was in the 2.7 branch of the AMD SDK
If there used to be such an extension, well it doesn't seem there anymore (and I was unable to find any info on it).
Thanks @axeldavy for reaching out. I will check with OpenCL team and get back to you asap. Thank you.
Is this still an issue? If not, can we please close it?
To the best of my knowledge, this is still an issue. Yours.
Thanks for the reply!
@ROCmSupport Have we got a response from the OpenCL team? If so; what was there response? Also, please advise next steps? Thanks!
@axeldavy, I have reached out to the internal team for feedback. Extending OpenCL is on their TODO list but at a low priority. We are currently keeping this ticket opened and will re-visit in 2024 Q2. Thanks.
@axeldavy, unfortunately Extending OpenCL is still a low priority. We will keep this ticket opened and will revisit the priority in 2025. Thanks.
Damn. Things like these, almost a decade later, are what motivates someone to go to other vendors. Cool algorithms that deal with single kernel scan from gpus that have no capability on forward progress guarantee such as from a paper released a few weeks ago would take advantage of this.