cutlass [BUG] CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS when using cutlass_tensorop

Describe the bug

I am trying to do a gemm between two fp32 arrays using the python api to produce a fp32 output. I would like to leverage tensor cores for this operation.

I modified the the python example (example 40) as follows to use cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1 but it fails with RuntimeError: CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS. As mentioned here, that kernel should Use 1xTF32. FP32 in, converted to one TF32 internally, accumulated in FP32, FP32 out.. However, my modified example works correctly with a SIMT sgemm kernel cutlass_simt_sgemm_128x128_8x4_nt_align1.

My modified script: https://gist.github.com/rkindi/ac9396c23ad9912879e772948b44f075

I also printed out the emitted kernels:

cutlass_simt_sgemm_128x128_8x4_nt_align1 (works correctly)

#include "cutlass/cutlass.h"
#include "cutlass/numeric_types.h"
#include "cutlass/arch/arch.h"
#include "cutlass/arch/mma.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "cutlass/gemm/kernel/default_gemm_universal.h"

// Gemm operator cutlass_simt_sgemm_128x128_8x4_nt_align1
using cutlass_simt_sgemm_128x128_8x4_nt_align1_base = 
  typename cutlass::gemm::kernel::DefaultGemmUniversal<
    float, cutlass::layout::ColumnMajor, cutlass::ComplexTransform::kNone, 1,    // transposed B operand
    float, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 1,    // transposed A operand
    float, cutlass::layout::RowMajor,
    float,
    cutlass::arch::OpClassSimt,
    cutlass::arch::Sm80,
    cutlass::gemm::GemmShape<128, 128, 8>,
    cutlass::gemm::GemmShape<32, 64, 8>,
    cutlass::gemm::GemmShape<1, 1, 1>,
    
    cutlass::epilogue::thread::LinearCombination<
      float,
      1,
      float,
      float
    >
,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
    4,
    cutlass::arch::OpMultiplyAdd
>::GemmKernel;

// Define named type
struct cutlass_simt_sgemm_128x128_8x4_nt_align1_type : 
  public cutlass_simt_sgemm_128x128_8x4_nt_align1_base { };

extern "C"
__global__ void
cutlass_simt_sgemm_128x128_8x4_nt_align1(cutlass_simt_sgemm_128x128_8x4_nt_align1_type::Params params) {

  // Dynamic shared memory base pointer
  extern __shared__ int SharedStorageBase[];

  // Declare pointer to dynamic shared memory.
  cutlass_simt_sgemm_128x128_8x4_nt_align1_type::SharedStorage *shared_storage =
      reinterpret_cast<cutlass_simt_sgemm_128x128_8x4_nt_align1_type::SharedStorage *>(SharedStorageBase);

  cutlass_simt_sgemm_128x128_8x4_nt_align1_type op;

  op(params, *shared_storage);
}

cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1 (illegal address error)

#include "cutlass/cutlass.h"
#include "cutlass/numeric_types.h"
#include "cutlass/arch/arch.h"
#include "cutlass/arch/mma.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "cutlass/gemm/kernel/default_gemm_universal.h"

// Gemm operator cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1
using cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_base = 
  typename cutlass::gemm::kernel::DefaultGemmUniversal<
    float, cutlass::layout::ColumnMajor, cutlass::ComplexTransform::kNone, 1,    // transposed B operand
    float, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 1,    // transposed A operand
    float, cutlass::layout::RowMajor,
    float,
    cutlass::arch::OpClassTensorOp,
    cutlass::arch::Sm80,
    cutlass::gemm::GemmShape<128, 128, 16>,
    cutlass::gemm::GemmShape<64, 64, 16>,
    cutlass::gemm::GemmShape<16, 8, 8>,
    
    cutlass::epilogue::thread::LinearCombination<
      float,
      1,
      float,
      float
    >
,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
    3,
    cutlass::arch::OpMultiplyAdd
>::GemmKernel;

// Define named type
struct cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type : 
  public cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_base { };

extern "C"
__global__ void
cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1(cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type::Params params) {

  // Dynamic shared memory base pointer
  extern __shared__ int SharedStorageBase[];

  // Declare pointer to dynamic shared memory.
  cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type::SharedStorage *shared_storage =
      reinterpret_cast<cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type::SharedStorage *>(SharedStorageBase);

  cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type op;

  op(params, *shared_storage);
}

Steps/Code to reproduce bug

I ran the script with CUDA_LAUNCH_BLOCKING=1 python3 test-cutlass-py.py. When running with cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1, we get the following output:

Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7f7c33400000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7f7c33410000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7f7c33420000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7f7c33430000
Traceback (most recent call last):
  File "examples/40_cutlass_py/test-cutlass-py.py", line 159, in <module>
    raise RuntimeError('CUDA Error %s' % str(err))
RuntimeError: CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS

Expected behavior No illegal address error should occur and the printed delta value should be close to zero. It works correctly when running with cutlass_simt_sgemm_128x128_8x4_nt_align1.

Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7ff5cb400000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7ff5cb410000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7ff5cb420000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory  0x7ff5cb430000
Array A: [[ 1. -2.  0. ... -2. -4. -1.]
 [ 2. -3. -1. ...  0. -3. -3.]
 [ 2.  1. -1. ... -1.  0. -2.]
 ...
 [-2.  1. -3. ... -1. -4. -4.]
 [ 1. -4. -4. ...  0. -2. -3.]
 [-1. -2. -1. ... -3.  2. -2.]]
Array B: [[-2. -1. -3. ... -3. -3. -4.]
 [-1.  1.  2. ... -4.  1.  2.]
 [ 2.  0.  2. ...  1. -4.  2.]
 ...
 [ 3.  2.  1. ...  1.  2. -3.]
 [-4.  3.  0. ...  1. -3.  2.]
 [ 0. -4. -2. ...  0. -3. -4.]]
Array C: [[-4.  2. -1. ...  2. -1. -1.]
 [-1. -4.  1. ... -1.  1. -2.]
 [ 0. -3.  1. ...  1. -4. -1.]
 ...
 [ 3. -1.  0. ...  0. -4.  3.]
 [-2.  1. -4. ...  3. -2. -4.]
 [ 2.  3. -3. ...  1. -4.  3.]]
Array D: [[  44.    0.   38. ...  -15.  -44.  -55.]
 [ -46.  -26.  162. ...    1.    6.  -59.]
 [  12.   63.   80. ...   38.   65.  -14.]
 ...
 [  -5.  -10.   22. ...   67.   51.   43.]
 [ 109.  -56.  -27. ...   79.   28.   17.]
 [-114.  144.   76. ...   42.  102.  -54.]]
Reference: [[  44.    0.   38. ...  -15.  -44.  -55.]
 [ -46.  -26.  162. ...    1.    6.  -59.]
 [  12.   63.   80. ...   38.   65.  -14.]
 ...
 [  -5.  -10.   22. ...   67.   51.   43.]
 [ 109.  -56.  -27. ...   79.   28.   17.]
 [-114.  144.   76. ...   42.  102.  -54.]]
Delta: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Environment details (please complete the following information):

Environment location: Docker, AWS, SM80,

Additional context Add any other context about the problem here.

Aug 31 '22 13:08 rkindi

Upon trying some other problem sizes, I found that the SIMT kernel also fails in many cases https://gist.github.com/rkindi/9a25a6d1cbcb5a96167f38ed2fc6b3cc (not with address alignment issue, but with incorrect output).

In the linked gist, I tried several shapes and found the following of the shapes I tried:

# Works
# M, N, K = (8, 8, 8)
# M, N, K = (8, 32, 16)
# M, N, K = (16, 32, 16)
# M, N, K = (16, 128, 16)
# M, N, K = (16, 1024, 16)
# M, N, K = (32, 1024, 16)

# Does not work
# M, N, K = (32, 32, 32)
# M, N, K = (64, 1024, 16) - upper half of result is correct, but lower half is wrong

Sep 01 '22 16:09 rkindi

2.10 reimplemented pycutlass. please give it a try.

Sep 17 '22 02:09 hwu36

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Oct 17 '22 16:10 github-actions[bot]

@rkindi has your issue been solved?

Oct 29 '22 12:10 mnicely

[BUG] CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS when using cutlass_tensorop_s1688tf32gemm op