composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

Kernels with LDS bank conflicts

Open rosenrodt opened this issue 2 years ago • 2 comments

Our kernel sees partial 2-way bank conflict for K-contiguous matrices, and full 2-way/4-way conflict for MN-contiguous matrices (dependent on tile sizes). Profiling has shown partial 2-way bank conflict for K-contiguous matrices doesn't result in observable impact on LDS issue latency. Therefore, our focus is to resolve bank conflict for MN-contiguous matrices.

The following LDS layout is chosen to avoid LDS write bank conflict for MN-contiguous matrices:

  • K0_M_1/K0_N_1 (fp32)
  • K0_M_2/K0_2_N (bf16/fp16)
  • K0_M_4/K0_N_4 (int8)

Conflict-free kernels are already added to C-shuffle device GEMM instances defined in library/src/tensor_operation_instance/gpu/gemm/device_gemm_xdl_c_shuffle_(type)_(layout)_instance.cpp. GridwiseGemm_k0mk1_k0nk1_mn_xdl_cshuffle_v1 (the underlying implementation of C-shuffle device GEMM) supports different K1 value for each individual A/B matrix to facilitate preferred conflict-free LDS layout for each A/B tile.

Other GEMM/Conv device kernels that do not yet facilitate the latest C-shuffle gridwise GEMM implementation still observe bank conflicts.

GEMM

  • Batched GEMM
  • Group GEMM
  • Split-K GEMM
  • Non-C-shuffle ordinary GEMM DeviceGemmXdl

Conv

  • Backward data (conflict in weight)
  • Backward weight (conflict in both output gradient and activation)
  • Forward prop maps to K-contiguous implicit GEMM layout so there's no observable bank conflict.

rosenrodt avatar Apr 07 '22 00:04 rosenrodt

@rosenrodt how can I check whether it is partial 2-way or full 2-way/4-way conflict? Thanks!

weixingzhang avatar Apr 13 '22 16:04 weixingzhang

@rosenrodt Apologies for the lack of response. Can you please if this is still an issue on the latest ROCm 6.2? If not, please close the ticket. Thanks!

ppanchad-amd avatar Aug 20 '24 16:08 ppanchad-amd