Kunwar Grover
Kunwar Grover
This patch adds support for lowering attention through VectorDistribution pipeline. Currently, it has the following limitations which will be fixed as followups: - Only 1 subgroup is used - There...
Depends on: https://github.com/iree-org/iree/pull/17536
This pass is similar to loop-invariant-code-motion but loops for loop invariant subsets instead. This pass is required for post-vectorization cleanups.
Depends on https://github.com/iree-org/iree/pull/17626
This patch adds support for distributing attention to multiple subgroups. Some points to note: - Due to some issues with layout analysis, we cannot yet do multiple n subgroups. This...
Clipping or clamping is defined as: ``` clip(x, min_value, max_value) = min(max(x, min_value), max_value) ``` Some backends can generate better instructions if it's known we are clamping a value. For...
### Request description # Motivation A pattern we notice in flash attention kernels is: ``` A: tensor B: tensor C: tensor D : tensor = matmul(A, B, C) E :...
This PR teaches attention decomposition to set attributes for attention matmuls by passing attribute dictionaries to iree_linalg_ext.online_attention operation. This allows us to further control codegen of matmuls (generally the root...
Since https://github.com/iree-org/iree/pull/18748 tensor.pad can be fused in with tiling. This patch combines the parallel and reduction padding passes into a single pass that pads at once, and the pads are...