zjing14 comments

Results 25 comments of


                                            zjing14

CK Instance Gen

@bartekxk @tenpercent Could you approve the PR if you are OK with it?

Does DeviceGemmMultipleABD_Xdl-CShuffle support the layout of B as row?

@xiabo123 What the gemm case you are running?

Does DeviceGemmMultipleABD_Xdl-CShuffle support the layout of B as row?

@xiabo123 Do you mean the PR #978 can resolve your issue?

[Issue]: What exactly does example_gemm_xdl_fp16 do?

Could you post your steps and gemm cases for reproduction?

[Issue]: Flash Attention Failure on AMD Mi50

@ThePerfectComputer Thanks for being interested in our Flash Attention. Our Flash Attention is implemented for MI100 and later DC GPUs. MI50, which lacks of AMD matrix cores (mfma), cannot provide...

[Issue]: Flash Attention Failure on AMD Mi50

You may be interested in our Flash Attention works on Navi3x: https://github.com/ROCm/composable_kernel/discussions/1032

Refactor f8_t and bf8_t as custom types, enable use of custom types

Could you do a performance check to make sure the new custom data type has no impact on the performance of fp8_gemm?

int4 inverse quantization and gemm on existing templates

HIP does not support sub-byte data types. Are you using int4x2?

mha dosen't support mfma_f32_16x16x16f16 instruction

Different mfma instructions have various register input/output layouts. You can refer this: https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator

mha dosen't support mfma_f32_16x16x16f16 instruction

@hengyeliu You may refer to our mha gemm: https://github.com/ROCm/composable_kernel/blob/84832fc42d71e446fa2ddbf88b96fc2c05b21b49/include/ck/tensor_operation/gpu/warp/xdlops_gemm.hpp#L202 You need to transfer the output of mha for writing out to global memory.