wrong parameter configuration
The line 376 in the file "cudaTensorCoreGemm.cu" : "float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * K + j * N;" should be modified to "float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * M + j * N;" This is applied to the tf32 and double precision. For the result matrix, there is nothing about the K dimension when streaming from the fragment to the shared memory.
Hi @Zeyu-W, thanks for reporting this issue. I agree multiplying by K here is incorrect. But, SHMEM_STRIDE is already defined as N * BLOCK_ROW_TILES, I think also multiplying by M here would result in incorrect stride. Should it not just be
float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE + j * N;
?
Oh my god! I didn't expect you actually to reply to me! Such an honor!
But I'm sorry, I don't think so. The purpose of the pointer "shmem_warp_tile_ptr" is to indicate the starting address of the 8 tiles to be copied for each warp, and each warp can only copy one tile at a time, which requires the pointer "tile_ptr" to indicate the starting address of the tile to be copied each time for each warp, and the size of each tile is 1616. As you mentioned, SHMEM_STRIDE has been defined as N * BLOCK_ROW_TILES, but this is only the leading dimension of the sub-matrix (128128) in the shared memory. When we indicate the tile to be copied each time for each warp based on "shmem_warp_tile_ptr", we need to jump between "tile columns" instead of "element columns", so I still believe that
float *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * M + j * N;
is the correct answer.
Thanks for your reply - let me take another look at it. This isn't originally my code so I may have misread it when I was initially looking at it.