mha dosen't support mfma_f32_16x16x16f16 instruction
Hello, I try to change mfma_f32_32x32x8f16 instruction to mfma_f32_16x16x16f16 instruction in grouped_multihead_attention_forward_v2.cpp, but I get wrong result. Is there anything else need to be modified besides the GEMM parameter?
Different mfma instructions have various register input/output layouts. You can refer this: https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator
Different mfma instructions have various register input/output layouts. You can refer this: https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator
Hi,You mean I should modify the internal code of MHA in CK if I use different mfma instruction?or can you just provide a simple MHA example which used mfma_f32_16x16x16f16?
@hengyeliu You may refer to our mha gemm: https://github.com/ROCm/composable_kernel/blob/84832fc42d71e446fa2ddbf88b96fc2c05b21b49/include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp#L202
You need to transfer the output of mha for writing out to global memory.
@hengyeliu Has your issue been resolved? If so, please close the ticket. Thanks!
@hengyeliu Closing ticket. Please feel free to leave a comment if you need further assistance and we'll re-open the ticket. Thanks!