Alexander Efimov issues

Results 20 issues of


                                            Alexander Efimov

[AMD] MFMA shortcut test

This PR: - moves shortcut check above allocation code, before any scratch buffer shape is computed - raise priority of AMD specific over common conversions

[WIP] Support small dots and optimization of dot operands

This PR - Introduces several fixes in FMA dot implementation - Enables support of small dots with MNK dimensions down to 1 - Introduces dot operand optimization for dots with...

[AMD] Refactor decompose-unsupported-amd-conversions pass

This PR introduces: - Use common code, simplify pass code - Support 3d tensors in mfma -> dot conversion(supported in common code from item above) - More tests for decompose-unsupported-amd-conversions...

[AMD] Support FP8E5M2 with MFMA FP16 instructions

Cast dot arguments from unsupported FP8 to supported FP16 in order to use MFMA instructions instead of FMA. This approach is expected to give better performance and be more stable...

[Backend] Improve dot support to target FMA

This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands This PR is a...

[AMD] Fix shared layout order for batch dimension in pipeline passes

Batch dimension should be slowest one, other cases are not supported by MFMA/WMMA/MMA pipeline.

Layout conversion bypass for blocked to dotOperand

This PR extends shared memory bypass for blocked->dot operand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication. This PR is a part of PR series. Final goal is to...

[WIP] [AMD] Emit AMD specific intrinsics for dot

This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes This...

Hoist reduction outside a loop

This PR introduces an optimization that hoists reduction operation of dot accumulator outside a loop over K dimension: %acc = for k tiles: %acc3d_input = reshape %acc %acc3d_out = dot3d(%x,...

[WIP] Optimize fma dot

The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor...