iree [GPU][Codegen] LLVMGPU Matvec Codegen Tracker

This issue tracks pending issues for matvec codegen on LLVMGPU using VectorDistribute pipeline.

Performance

TODO: Add traces with commits

Tracking dispatch: https://gist.github.com/Groverkss/60d0cc3c59a424650f28ac72198e48d2

[x] Determine correct load bitwidth for mixed precision matvecs (currently using fp32 bitwidth instead of fp16 bitwidth)
[x] Use a better configuration logic for multi-row matvec (in the tracking dispatch, increasing workgroup tile sizes from 2 -> 4 reduces runtime by half on RDNA3 gpus)
[x] Improve heuristic selection logic (the original logic has hardcoded values for when it was written for RDNA3)
[ ] Improve reduction lowering https://github.com/iree-org/iree/issues/21483

Deprecate Warp Reduction pipeline

[ ] Completion tracked in sub-issue https://github.com/iree-org/iree/issues/21421

Jul 08 '25 12:07 Groverkss

Ideas from yesterday's discussion with @Groverkss:

New parallel partial reduction dimension should be outermost in the output
Split-k subgroup on reduction dimension for tall skinny gemm
Expanding reduction dimension into 2, and partially tiling only one of them to achieve better fma instructions. vector.transfer_write distribution make only leader store
Add paged attention matvec benchmark to matvec issue

Jul 18 '25 16:07 kuhar

Raw notes about lowering_config semantics for matvecs: https://gist.github.com/Groverkss/015ed5af8db6e804bdf560fc35db1d4f

WORKGROUP = []
partial_reduction = []

subgroup_basis --> numSubgroups[dim]
thread_basis -->   numThreads[dim]
thread       -->   vectorSize[dim]

dim -> workgroup/partial_reduction

numSubgroups[dim] * numThreads[dim] * vectorSize[dim]

grid, mapping

grid_count:  1 x 4 x 2 x 2
grid_stride: 16  4   2   1

3, 1, 2

M, N, K

2  4  2
1  4  2

grid_count: 1 x 2 x 2
grid_strid: 4   2   1

M, N, K
1  1  8

------

grid: 1 2 2
mapping: 2 1

subgroup_basis = [[1, 2, 2], [2, 1]]

%sgid = gpu.lane_id
%vsgid:4 = affine.delinearize 1, 2, 2

%vs0 = %vsgid#2
%vs1 = %vsgid#1


1 1 1 --> subgroup_basis
1 1 4 --> thread_basis
1 1 2 --> thread

1 1 8

tid0: 0 1
tid1: 2 3 
tid2: 4 5 
tid3: 6 7 

0 1 2 3 4 5 6 7

{lowering_config = #iree_gpu.lowering_config<{
  workgroup = [4, 1, 0]}>
  partial_reduction = [0, 0, 8192], 
  subgroup_basis = [[1, 1, 8], [0, 1, 2]], 
  thread_basis = [[1, 1, 64], [0, 1, 2]], 
  thread = [2, 0, 8], 
}

per workgroup:

4 1 8192

per subgroup (subgroup_basis):

4 1 1024

per outer loop (thread_basis * thread):

2 1 512

per thread (thread):

2 1 8

Jul 18 '25 16:07 kuhar

Added https://github.com/iree-org/iree/issues/21483 to the top-level checklist

Jul 24 '25 18:07 kuhar