composable_kernel
composable_kernel copied to clipboard
Kernels with LDS bank conflicts
Our kernel sees partial 2-way bank conflict for K-contiguous matrices, and full 2-way/4-way conflict for MN-contiguous matrices (dependent on tile sizes). Profiling has shown partial 2-way bank conflict for K-contiguous matrices doesn't result in observable impact on LDS issue latency. Therefore, our focus is to resolve bank conflict for MN-contiguous matrices.
The following LDS layout is chosen to avoid LDS write bank conflict for MN-contiguous matrices:
- K0_M_1/K0_N_1 (fp32)
- K0_M_2/K0_2_N (bf16/fp16)
- K0_M_4/K0_N_4 (int8)
Conflict-free kernels are already added to C-shuffle device GEMM instances defined in library/src/tensor_operation_instance/gpu/gemm/device_gemm_xdl_c_shuffle_(type)_(layout)_instance.cpp. GridwiseGemm_k0mk1_k0nk1_mn_xdl_cshuffle_v1
(the underlying implementation of C-shuffle device GEMM) supports different K1 value for each individual A/B matrix to facilitate preferred conflict-free LDS layout for each A/B tile.
Other GEMM/Conv device kernels that do not yet facilitate the latest C-shuffle gridwise GEMM implementation still observe bank conflicts.
GEMM
- Batched GEMM
- Group GEMM
- Split-K GEMM
- Non-C-shuffle ordinary GEMM
DeviceGemmXdl
Conv
- Backward data (conflict in weight)
- Backward weight (conflict in both output gradient and activation)
- Forward prop maps to K-contiguous implicit GEMM layout so there's no observable bank conflict.
@rosenrodt how can I check whether it is partial 2-way or full 2-way/4-way conflict? Thanks!
@rosenrodt Apologies for the lack of response. Can you please if this is still an issue on the latest ROCm 6.2? If not, please close the ticket. Thanks!