ROCm Access to cross-lane operations with OpenCL extensions

Hi,

Intel has a very useful extension: cl_intel_subgroups

Which enables inside a subgroup (a wavefront) to shuffle items, do reduce operations, etc.

According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ Recent AMD hardware can do the same, and even better.

I know this functionnality is available via HSA or inline assembly, but there is no OpenCL extension supported by AMD for that. Assembly is not a good solution for an OpenCL developper, as the assembly might need to be updated for new cards or for bug workarounds. Please make it an extension !

Features I'd like to have: shuffle, fine grained reduction operations. For example reduction among work items 0, 8, 16, etc, and 1, 9, 17, etc you get the idea, or reduction among 0-7, 8-15, etc. This type of fine grained reduction would be very useful. Going through LDS is possible, but for a reduction operation, you need several lds reads, and using the cross lane operations would be much faster.

Jul 12 '18 00:07 axeldavy

There is a amd extention for that if I recal it was in the 2.7 branch of the AMD SDK

Jul 22 '18 15:07 ghost

If there used to be such an extension, well it doesn't seem there anymore (and I was unable to find any info on it).

Jul 22 '18 16:07 axeldavy

Thanks @axeldavy for reaching out. I will check with OpenCL team and get back to you asap. Thank you.

Jan 07 '21 06:01 ROCmSupport

Is this still an issue? If not, can we please close it?

Dec 08 '23 17:12 tasso

To the best of my knowledge, this is still an issue. Yours.

Dec 08 '23 19:12 axeldavy

Thanks for the reply!

@ROCmSupport Have we got a response from the OpenCL team? If so; what was there response? Also, please advise next steps? Thanks!

Dec 08 '23 19:12 tasso

@axeldavy, I have reached out to the internal team for feedback. Extending OpenCL is on their TODO list but at a low priority. We are currently keeping this ticket opened and will re-visit in 2024 Q2. Thanks.

Jan 02 '24 16:01 nartmada

@axeldavy, unfortunately Extending OpenCL is still a low priority. We will keep this ticket opened and will revisit the priority in 2025. Thanks.

Oct 16 '24 21:10 nartmada

Damn. Things like these, almost a decade later, are what motivates someone to go to other vendors. Cool algorithms that deal with single kernel scan from gpus that have no capability on forward progress guarantee such as from a paper released a few weeks ago would take advantage of this.

Aug 20 '25 05:08 mambastudio