[GPU] Attention GPU Codegen
This issue tracks tasks to support attention op in GPU codegen targeting tensor cores properly. Initial proof of concept PR that works e2e for gpus with MFMA: https://github.com/iree-org/iree/pull/17212
Short Term Enablement:
- [ ] LinalgFoldUnitExtentDimsPass ignores DPS style code and moves
outsoperands of a linalg operation toins. This is a short-term solution. See [1] for more information. - [ ] Make GPUVectorAlloc controllable. In Flash Attention 2, there are two contracts; however, for AMDGPU, you should only promote 2 of the 4 possible operands to the contracts.
- [ ] Implement PartialReductionInterface for linalg_ext.attention, and make the TileAndDecomposeAttentionPass just DecomposeAttentionPass
- [ ] Split GPUVectorDistribution to 3 different passes: SetAnchors, ResolveConflicts, Distribute
- [ ] Implement conflict resolution with a trip to shared memory for WMMA operations
Long Term Enablement:
Longer term, we would want to do subgroup distribution and shared memory allocation at linalg level with something like: https://github.com/iree-org/iree/issues/17148, move the SetAnchor pass to pre vectorization.
[1]: The way we currently do vectorization is local, and wraps each linalg operations operands/results in transfer_read/transfer_write and vectorizes the operation. This works well until we have loops. Chaining these transfer_read/transfer_write through loops doesn't always work. To make it work today, the attention decomposition keeps the linalg operations in DPS style and hopes that this form isn't removed until we vectorized. Ideally, we would want vectorization to do an analysis, determining which tensors should be vector and this would not cause these spurious allocations.