SGEMM_CUDA
SGEMM_CUDA copied to clipboard
Issues with row-major vs col-major in the repo and blog.
The naive kernel uses x for the x dimension and y for the y dimension.
https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L17-L18
Looking at the CUDA programming guide (e.g., here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory) shows that x thus corresponds to the col and y to the row. This is also what the figures in the blogpost use.
The kernel itself, however, then proceeds to assume column-major formatting: https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L20-L28
C[4, 0], (x = 4, y = 0, first row, fifth column, 10x10 matrix so N = 10), would be the fifth element in linear memory for row-major formatting, but C[x * N + y = 40] makes it the 41st element, which corresponds to column-major formatting. The programming guide assumes row-major formatting in the kernel examples.
Why the choice to make the kernels in this repo use column-major formatting? Perhaps because cuBLAS also uses column-major instead of row-major storage?
This also causes the if statement for the tile quantization to be wrong I think:
https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L10-L11
https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L21
It checks whether x (which is the column index) is smaller than M (which is the number of rows of C). Similarly y < N, which compares the row index to the number of columns...
The figure just before the introduction of kernel 2 on the blog also seems wrong to me. It states that "A, B, C, are stored in row-major order. This means that the last index (here y) is the one that iterates continuous through memory (=has stride 1)."
An L x R two dimensional matrix indexed by (x, y) uses x for columns and y for rows. In row-major format the x then iterates continuously while the y only changes after each full row. So x has a stride of 1 while y has a stride of R if I'm not mistaken.
Why the choice to make the kernels in this repo use column-major formatting? Perhaps because cuBLAS also uses column-major instead of row-major storage?
@KillianMelsen See https://github.com/siboehm/SGEMM_CUDA/issues/7#issuecomment-2057023404.
Essentially writing it in this way was to motivate kernel 2 demonstrating global memory coalescing.
Why the choice to make the kernels in this repo use column-major formatting? Perhaps because cuBLAS also uses column-major instead of row-major storage?
@KillianMelsen See #7 (comment).
Essentially writing it in this way was to motivate the kernel 2 demonstrating global memory coaleseing.
Ah okay that makes sense indeed. Thanks!