cuda-samples wrong parameter configuration

The line 376 in the file "cudaTensorCoreGemm.cu" : "float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * K + j * N;" should be modified to "float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * M + j * N;" This is applied to the tf32 and double precision. For the result matrix, there is nothing about the K dimension when streaming from the fragment to the shared memory.

Nov 12 '24 10:11 Zeyu-W

Hi @Zeyu-W, thanks for reporting this issue. I agree multiplying by K here is incorrect. But, SHMEM_STRIDE is already defined as N * BLOCK_ROW_TILES, I think also multiplying by M here would result in incorrect stride. Should it not just be

float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE + j * N;

?

Feb 18 '25 21:02 rwarmstr

Oh my god! I didn't expect you actually to reply to me! Such an honor! But I'm sorry, I don't think so. The purpose of the pointer "shmem_warp_tile_ptr" is to indicate the starting address of the 8 tiles to be copied for each warp, and each warp can only copy one tile at a time, which requires the pointer "tile_ptr" to indicate the starting address of the tile to be copied each time for each warp, and the size of each tile is 1616. As you mentioned, SHMEM_STRIDE has been defined as N * BLOCK_ROW_TILES, but this is only the leading dimension of the sub-matrix (128128) in the shared memory. When we indicate the tile to be copied each time for each warp based on "shmem_warp_tile_ptr", we need to jump between "tile columns" instead of "element columns", so I still believe that float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * M + j * N; is the correct answer.

Feb 20 '25 10:02 Zeyu-W

Thanks for your reply - let me take another look at it. This isn't originally my code so I may have misread it when I was initially looking at it.

Feb 20 '25 16:02 rwarmstr