zjing14
zjing14
@bartekxk @tenpercent Could you approve the PR if you are OK with it?
@xiabo123 What the gemm case you are running?
@xiabo123 Do you mean the PR #978 can resolve your issue?
Could you post your steps and gemm cases for reproduction?
@ThePerfectComputer Thanks for being interested in our Flash Attention. Our Flash Attention is implemented for MI100 and later DC GPUs. MI50, which lacks of AMD matrix cores (mfma), cannot provide...
You may be interested in our Flash Attention works on Navi3x: https://github.com/ROCm/composable_kernel/discussions/1032
Could you do a performance check to make sure the new custom data type has no impact on the performance of fp8_gemm?
HIP does not support sub-byte data types. Are you using int4x2?
Different mfma instructions have various register input/output layouts. You can refer this: https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator
@hengyeliu You may refer to our mha gemm: https://github.com/ROCm/composable_kernel/blob/84832fc42d71e446fa2ddbf88b96fc2c05b21b49/include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp#L202 You need to transfer the output of mha for writing out to global memory.