llvm-project icon indicating copy to clipboard operation
llvm-project copied to clipboard

[Feature]: Compiler option for expanding s_waitcnt instructions or don’t merge them in the first place

Open bbiiggppiigg opened this issue 1 year ago • 6 comments

Suggestion Description

From the kernels we have studied, it is not uncommon to see the use of a single waitcnt instruction to wait for multiple load instructions to finish loading their values.

Once the PC-sampling feature is enabled on these kernels, we expect to see a non-negligible amount of pc-samples reported at the waitcnt instructions.

In order to figure out which load instruction might be a/the bottleneck, it would be nice to have a compiler option that expands a single waitcnt instruction for value N into a series of waitcnt instructions with decreasing value from N+k, N+k-1, … N, when the waitcnt instruction is waiting for k loads to complete.

Please note that we are NOT looking for the existing compiler option -amdgpu-waitcnt-forcezero that adds an s_waitcnt(0) after every instruction, as we still want to hide the memory load latency with compute instructions as much as possible.

Operating System

No response

GPU

MI200 / MI250 / MI300

ROCm Component

No response

bbiiggppiigg avatar Feb 23 '24 19:02 bbiiggppiigg