SGEMM_CUDA Issues with row-major vs col-major in the repo and blog.

trafficstars

The naive kernel uses x for the x dimension and y for the y dimension. https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L17-L18 Looking at the CUDA programming guide (e.g., here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory) shows that x thus corresponds to the col and y to the row. This is also what the figures in the blogpost use.

The kernel itself, however, then proceeds to assume column-major formatting: https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L20-L28

C[4, 0], (x = 4, y = 0, first row, fifth column, 10x10 matrix so N = 10), would be the fifth element in linear memory for row-major formatting, but C[x * N + y = 40] makes it the 41st element, which corresponds to column-major formatting. The programming guide assumes row-major formatting in the kernel examples.

Why the choice to make the kernels in this repo use column-major formatting? Perhaps because cuBLAS also uses column-major instead of row-major storage?

This also causes the if statement for the tile quantization to be wrong I think: https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L10-L11 https://github.com/siboehm/SGEMM_CUDA/blob/60cba6f9b20a198116c76f18de8047f44df8c8b8/src/kernels/1_naive.cuh#L21 It checks whether x (which is the column index) is smaller than M (which is the number of rows of C). Similarly y < N, which compares the row index to the number of columns...

The figure just before the introduction of kernel 2 on the blog also seems wrong to me. It states that "A, B, C, are stored in row-major order. This means that the last index (here y) is the one that iterates continuous through memory (=has stride 1)."

An L x R two dimensional matrix indexed by (x, y) uses x for columns and y for rows. In row-major format the x then iterates continuously while the y only changes after each full row. So x has a stride of 1 while y has a stride of R if I'm not mistaken.

Mar 07 '25 10:03 KillianMelsen

Why the choice to make the kernels in this repo use column-major formatting? Perhaps because cuBLAS also uses column-major instead of row-major storage?

@KillianMelsen See https://github.com/siboehm/SGEMM_CUDA/issues/7#issuecomment-2057023404.

Essentially writing it in this way was to motivate kernel 2 demonstrating global memory coalescing.

Mar 14 '25 10:03 cudawarped

Why the choice to make the kernels in this repo use column-major formatting? Perhaps because cuBLAS also uses column-major instead of row-major storage?

@KillianMelsen See #7 (comment).

Essentially writing it in this way was to motivate the kernel 2 demonstrating global memory coaleseing.

Ah okay that makes sense indeed. Thanks!

Mar 14 '25 12:03 KillianMelsen

SGEMM_CUDA SGEMM_CUDA copied to clipboard

Issues with row-major vs col-major in the repo and blog.

SGEMM_CUDA
SGEMM_CUDA copied to clipboard