SGEMM_CUDA
SGEMM_CUDA copied to clipboard
Solve bank conflict
In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads A[0][0]
to A[0][3]
, thread 1 reads A[0][4]
to A[0][7]
. So thread 0 writes As[0][0]
to As[3][0]
, thread 1 writes As[4][0]
to As[7][0]
. For a BM(=128) * BK(=8)
size As
, it is obvious that As[0][0]
and As[4][0]
are on the same bank, causing bank conflict.
So I think bank conflict will only occur when writing As
not Bs
. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs
:
https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/8_kernel_bank_extra_col.cuh#L56-L60
Did I understand something wrong?