Alexander Efimov
Alexander Efimov
This PR: - moves shortcut check above allocation code, before any scratch buffer shape is computed - raise priority of AMD specific over common conversions
This PR - Introduces several fixes in FMA dot implementation - Enables support of small dots with MNK dimensions down to 1 - Introduces dot operand optimization for dots with...
This PR introduces: - Use common code, simplify pass code - Support 3d tensors in mfma -> dot conversion(supported in common code from item above) - More tests for decompose-unsupported-amd-conversions...
Cast dot arguments from unsupported FP8 to supported FP16 in order to use MFMA instructions instead of FMA. This approach is expected to give better performance and be more stable...
This PR: - Refactors FMA dot implementation - Supports dot3d in FMA path - Fixes several issues in operand offset computation - Enables small dot operands This PR is a...
Batch dimension should be slowest one, other cases are not supported by MFMA/WMMA/MMA pipeline.
This PR extends shared memory bypass for blocked->dot operand conversions and adds bypass check in DecomposeUnsupportedConversions and ReduceDataDuplication. This PR is a part of PR series. Final goal is to...
This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes This...
This PR introduces an optimization that hoists reduction operation of dot accumulator outside a loop over K dimension: %acc = for k tiles: %acc3d_input = reshape %acc %acc3d_out = dot3d(%x,...
The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor...