iree
iree copied to clipboard
[GPU][Codegen] LLVMGPU Matvec Codegen Tracker
This issue tracks pending issues for matvec codegen on LLVMGPU using VectorDistribute pipeline.
Performance
TODO: Add traces with commits
Tracking dispatch: https://gist.github.com/Groverkss/60d0cc3c59a424650f28ac72198e48d2
- [x] Determine correct load bitwidth for mixed precision matvecs (currently using fp32 bitwidth instead of fp16 bitwidth)
- [x] Use a better configuration logic for multi-row matvec (in the tracking dispatch, increasing workgroup tile sizes from 2 -> 4 reduces runtime by half on RDNA3 gpus)
- [x] Improve heuristic selection logic (the original logic has hardcoded values for when it was written for RDNA3)
- [ ] Improve reduction lowering https://github.com/iree-org/iree/issues/21483
Deprecate Warp Reduction pipeline
- [ ] Completion tracked in sub-issue https://github.com/iree-org/iree/issues/21421
Ideas from yesterday's discussion with @Groverkss:
- New parallel partial reduction dimension should be outermost in the output
- Split-k subgroup on reduction dimension for tall skinny gemm
- Expanding reduction dimension into 2, and partially tiling only one of them to achieve better fma instructions. vector.transfer_write distribution make only leader store
- Add paged attention matvec benchmark to matvec issue
Raw notes about lowering_config semantics for matvecs: https://gist.github.com/Groverkss/015ed5af8db6e804bdf560fc35db1d4f
WORKGROUP = []
partial_reduction = []
subgroup_basis --> numSubgroups[dim]
thread_basis --> numThreads[dim]
thread --> vectorSize[dim]
dim -> workgroup/partial_reduction
numSubgroups[dim] * numThreads[dim] * vectorSize[dim]
grid, mapping
grid_count: 1 x 4 x 2 x 2
grid_stride: 16 4 2 1
3, 1, 2
M, N, K
2 4 2
1 4 2
grid_count: 1 x 2 x 2
grid_strid: 4 2 1
M, N, K
1 1 8
------
grid: 1 2 2
mapping: 2 1
subgroup_basis = [[1, 2, 2], [2, 1]]
%sgid = gpu.lane_id
%vsgid:4 = affine.delinearize 1, 2, 2
%vs0 = %vsgid#2
%vs1 = %vsgid#1
1 1 1 --> subgroup_basis
1 1 4 --> thread_basis
1 1 2 --> thread
1 1 8
tid0: 0 1
tid1: 2 3
tid2: 4 5
tid3: 6 7
0 1 2 3 4 5 6 7
{lowering_config = #iree_gpu.lowering_config<{
workgroup = [4, 1, 0]}>
partial_reduction = [0, 0, 8192],
subgroup_basis = [[1, 1, 8], [0, 1, 2]],
thread_basis = [[1, 1, 64], [0, 1, 2]],
thread = [2, 0, 8],
}
per workgroup:
4 1 8192
per subgroup (subgroup_basis):
4 1 1024
per outer loop (thread_basis * thread):
2 1 512
per thread (thread):
2 1 8
Added https://github.com/iree-org/iree/issues/21483 to the top-level checklist