triton
triton copied to clipboard
[WIP] Support small dots and optimization of dot operands
This PR
- Introduces several fixes in FMA dot implementation
- Enables support of small dots with MNK dimensions down to 1
- Introduces dot operand optimization for dots with small M(<=8), large N and K dimensions
- Introduces generation of v_dot2/v_dot4 instructions for AMD backend
These changes are needed to reduce granularity loss on dots with small M or N dimension, like (1x256)x(256x64)