[GPU][Codegen] Expand iteration space based on new `expand_dims` attribute
This patch introduces iteration space expansion for reductions in the VectorDistribute path.
Specifically, we:
- Add a new attribute,
expand_dims, for reductions. - Introduce a new pass,
GPUExpandDimensions, which usesexpand_dimsto expand the iteration space of relevant dimensions. - Refactor common functionality shared between
GPUExpandDimensionsandBlockDynamicDimensionsinto reusable utilities. - Refactor encoding helpers from
EncodingAttrs.cppinto reusable utilities.
This change also enables chain FMA in matvec codegen as we iterate along the K reduction dimension.
Performance Summary
IREE benchmark module
- Only expansion: ~4% improvement
- Expansion + chain FMA: ~11% improvement
rocprof
- Only expansion: ~13% worse
- Expansion + chain FMA: ~9% better
Register usage
- 10% reduction (60 → 54 registers for matvec dispatches)
Instruction latency (post-reduction loop epilogue)
- 3.5% improvement (340 → 328 total mean latency)
Notes
- As a follow-up, we can explore applying iteration space expansion to the reduction in attention
- Right now, we only expand one dimension into two although the implementation supports expansion to N dimensions.
- Please note this PR changes the reduction order, some expect some minor changes to the numerics
- This is does not improve performance by itself/can cause regression without chain FMA https://github.com/iree-org/iree/pull/21855
Traces for matvec dispatches are attached for all variations (original, only expansion, and expansion + chain FMA).
115_expansion_and_chain.tar.gz 115_nothing.tar.gz 115_only_expansion.tar.gz
Fixes: #22153
I’ve included all changes in this PR for now to show everything together. I can split the refactor into a separate NFC for readability if preferred and folks agree on its destination.
NFC bits have been factored out in a separate PR for convenience in reviewing.
@efric I added a ci-extra trailer to run test_torch can you check?