iree [Codegen][GPU] Adding scheduling barrier between compute and write stage in prefetcher

Motivation: I and @nirvedhmeshram observed a fair amount of misplacement of s_waitcnt instructions from the amdgpu backend. The backend tends to place the s_waitcnt overly early, sometimes in between MFMA instructions. This has caused irrelevant stalls between compute and load that should have overlap that can potentially hide the stalls.

This PR adds the scheduling barrier in between compute and write stage such that the backend compiler no longer can freely place the s_waitcnt earlier, in compute stage. It is forced to place the wait as early as the start of the write stage.

I've verified through thread-trace that the scheduling barrier does what is intended and the s_waitcnt placement isn't interleaved with MFMA instructions any longer.

Upon looking at CI result, this seems to provide a bit of overall lift to all model level tests. Compared target (tip of main at): https://github.com/iree-org/iree/actions/runs/15767966101/job/44447951812

Benchmark	Before	After
llama 8b_f16_decode	10.020865534565278 ms	9.531815848439127 ms
llama 8b_f16_prefill	39.90881894197729 ms	39.257620445763074 ms
sdxl clip	8.061422783367593 ms	7.982292237945579 ms
sdxl e2e	286.39431609772146 ms	285.5649597477168 ms
sdxl punet_int8_fp16	40.25666919153404 ms	39.64519846356578 ms
sdxl punet_int8_fp8	42.10774295360727 ms	40.13837954467711 ms
sdxl unet_fp16	68.40701173059642 ms	68.04983000271024 ms
sdxl vae	68.57368536293507 ms	67.36758409999312 ms

Jun 20 '25 14:06 jerryyin

@kuhar Good point. I think the scheduling barrier has existed for a decent amount of time that the ROCm side targets we cared about (CDNA and RDNA included) all have it.

Regarding on sched_barrier placement, I think it will be hard to reliably do it when it has lowered towards rocDL dialect level. We'll have to re-parse the entire IR to be able to recognize what section of IR belong to what stage, effectively duplicating this pass in a much later stage.

I'm not sure I fully understand the part of the suggestion related with the generic barrier op part. Could you elaborate a bit on that? The most intuitive thought I have right now is basically to add a condition check on the TargetAttr and only add the scheduling barrier when it is a amdgpu target. Does that sounds good enough to you?

Jun 20 '25 16:06 jerryyin

I think the scheduling barrier has existed for a decent amount of time that the ROCm side targets we cared about (CDNA and RDNA included) all have it.

This is different from issue I raised. IIUC, you haven't verified that the same barrier placement is beneficial on non-CDNA targets. The motivation is to guide the backend towards the waitcnt scheduling you like but gfx10+ uses a more fine-grained waitcnt mechanism. See https://gpuopen.com/download/RDNA_Architecture_public.pdf slides 37-51.

I'm not sure I fully understand the part of the suggestion related with the generic barrier op part. Could you elaborate a bit on that?

Barrier placement uses a target-independent dialect for this: gpu.barrier. This allows us to emit it in generic passes like the prefetching pass. One alternative is to add a similar op in IREE/mlir that later lowers to the amdgpu scheduling barrier op.

Jun 20 '25 16:06 kuhar

gfx10+ uses a more fine-grained waitcnt

Good to know and thanks for the slides! Although I doubt how much difference that will make because the bottleneck of conv/gemm are usually in buffer_loads so not sure separating the load and store count can be as useful. At this location in particular, stores hasn't been issued yet, and will only be issued at write stage.

One alternative is to add a similar op in IREE/mlir that later lowers to the rocdl scheduling barrier op.

IIRC, I think one distinction between sched_barrier and gpu.barrier is that sched_barrier only exists in amdgpu backend. So I'm really hesitant about adding an op that will apply to only a particular target but yield no-op in other targets.

Having went through this useful discussion, I tend to apply the original proposal of using a straightforward conditional check, and maybe a TODO for us to come back and generalize or delete as needed. Let me know if you have any objections to this. Thanks!

Jun 20 '25 17:06 jerryyin

Having went through this useful discussion, I tend to apply the original proposal of using a straightforward conditional check, and maybe a TODO for us to come back and generalize or delete as needed. Let me know if you have any objections to this. Thanks!

My remaining objection is that this seems like a layering violation. If you look at other target-agnostic passes under llvmgpu, none of the existing one emit target-specific intrinsics. We do have one pass, LLVMGPUVectorLowering.cpp that includes amdgpu patterns, and this seems to me like a bug we missed: https://github.com/iree-org/iree/blob/994a716bb428bff484e519e09c8fcd2ff5a44693/compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPUVectorLowering.cpp#L130

Jun 20 '25 17:06 kuhar

I can see why you think that is the case. BTW, that pass is also from me. I can open a new ticket to address that and we can continue to discuss separately so I don't end up to be the guy that disrupt iree coding standard :-p.

For the context of this PR, what would you suggest as a non-layer violating way to implement this? Do you agree with my concerns about building a target independent top-level op that's actually unique in amdgpu target? We can host a separate meeting if that turns out to be necessary.

Jun 20 '25 18:06 jerryyin

I can open a new ticket to address that and we can continue to discuss separately so I don't end up to be the guy that disrupt iree coding standard :-p.

SG.

what would you suggest as a non-layer violating way to implement this? Do you agree with my concerns about building a target independent top-level op that's actually unique in amdgpu target?

I left a couple suggestions at the end of this comment: https://github.com/iree-org/iree/pull/21151#pullrequestreview-2946664801 . IMO if having a generic op that may end up being a nop feels weird, emitting a target-specific op in this pass is 10x worse -- we are adding very low-level hints relatively high up in the compilation pipeline. To me having a target-independent op is lesser evil: it fixes layering violations at the level of MLIR dialects / build system, and allows you to delay deciding what to do with it when you have target information available and useable. Alternatively, make this a rocm pass if you don't care about nvidia anyway.

Jun 20 '25 19:06 kuhar

Ok understood. I think either way we'd have to ended with some decent amount of plumbing to get it right. Thanks for the input.

For a), creating and emitting a more generic op. The benefit is that with it, we can populate a scheduling barrier almost anywhere at the high level without violating layer violations. This sounds to be a useful infra since we start to find more cases that scheduling barrier tends to be useful. It'd be wired for non-amdgpu target but I couldn't care less about them.

For b), I don't want non-amd target to suddenly lose the ability to prefetch just because I'm adding a fairly trivial scheduling barrier to it. So this will become a parent base prefetcher and child AMD prefetcher, and then somewhere in the lowering pipeline I'd still need to branch and make sure I introduce to the right prefetcher. It is undesirable in terms that: This prefetcher will go away and this sounds like a fair amount of throw-away work. Also, I don't think the top level pipeline make it such a clear cut between amd and nvdia backend so this might very likely become a significant portion of work just to refactor and can impact fremont timelines I'm working with.

I'm leading towards a) based on the reasoning above if I have to choose one of them. Before doing the work, I'd propose this to be a codegen sync meeting so I can get more buy-ins from the team.

Jun 20 '25 19:06 jerryyin

Note this PR is rebased after #21190 merge and now ready for a second round of review.

I double checked the model perf turns out majority of them are within error margin. I mostly see improvement in punet_int8_fp8 model (41.194085639846676 ms -> 39.53284363572797 ms). This is against tip of main before the PR here

Jun 27 '25 16:06 jerryyin