[BUG] CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS when using cutlass_tensorop_s1688tf32gemm op
Describe the bug
I am trying to do a gemm between two fp32 arrays using the python api to produce a fp32 output. I would like to leverage tensor cores for this operation.
I modified the the python example (example 40) as follows to use cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1 but it fails with RuntimeError: CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS. As mentioned here, that kernel should Use 1xTF32. FP32 in, converted to one TF32 internally, accumulated in FP32, FP32 out.. However, my modified example works correctly with a SIMT sgemm kernel cutlass_simt_sgemm_128x128_8x4_nt_align1.
My modified script: https://gist.github.com/rkindi/ac9396c23ad9912879e772948b44f075
I also printed out the emitted kernels:
cutlass_simt_sgemm_128x128_8x4_nt_align1 (works correctly)
#include "cutlass/cutlass.h"
#include "cutlass/numeric_types.h"
#include "cutlass/arch/arch.h"
#include "cutlass/arch/mma.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "cutlass/gemm/kernel/default_gemm_universal.h"
// Gemm operator cutlass_simt_sgemm_128x128_8x4_nt_align1
using cutlass_simt_sgemm_128x128_8x4_nt_align1_base =
typename cutlass::gemm::kernel::DefaultGemmUniversal<
float, cutlass::layout::ColumnMajor, cutlass::ComplexTransform::kNone, 1, // transposed B operand
float, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 1, // transposed A operand
float, cutlass::layout::RowMajor,
float,
cutlass::arch::OpClassSimt,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<128, 128, 8>,
cutlass::gemm::GemmShape<32, 64, 8>,
cutlass::gemm::GemmShape<1, 1, 1>,
cutlass::epilogue::thread::LinearCombination<
float,
1,
float,
float
>
,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
4,
cutlass::arch::OpMultiplyAdd
>::GemmKernel;
// Define named type
struct cutlass_simt_sgemm_128x128_8x4_nt_align1_type :
public cutlass_simt_sgemm_128x128_8x4_nt_align1_base { };
extern "C"
__global__ void
cutlass_simt_sgemm_128x128_8x4_nt_align1(cutlass_simt_sgemm_128x128_8x4_nt_align1_type::Params params) {
// Dynamic shared memory base pointer
extern __shared__ int SharedStorageBase[];
// Declare pointer to dynamic shared memory.
cutlass_simt_sgemm_128x128_8x4_nt_align1_type::SharedStorage *shared_storage =
reinterpret_cast<cutlass_simt_sgemm_128x128_8x4_nt_align1_type::SharedStorage *>(SharedStorageBase);
cutlass_simt_sgemm_128x128_8x4_nt_align1_type op;
op(params, *shared_storage);
}
cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1 (illegal address error)
#include "cutlass/cutlass.h"
#include "cutlass/numeric_types.h"
#include "cutlass/arch/arch.h"
#include "cutlass/arch/mma.h"
#include "cutlass/layout/matrix.h"
#include "cutlass/gemm/device/gemm.h"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "cutlass/gemm/kernel/default_gemm_universal.h"
// Gemm operator cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1
using cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_base =
typename cutlass::gemm::kernel::DefaultGemmUniversal<
float, cutlass::layout::ColumnMajor, cutlass::ComplexTransform::kNone, 1, // transposed B operand
float, cutlass::layout::RowMajor, cutlass::ComplexTransform::kNone, 1, // transposed A operand
float, cutlass::layout::RowMajor,
float,
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
cutlass::gemm::GemmShape<128, 128, 16>,
cutlass::gemm::GemmShape<64, 64, 16>,
cutlass::gemm::GemmShape<16, 8, 8>,
cutlass::epilogue::thread::LinearCombination<
float,
1,
float,
float
>
,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<8>,
3,
cutlass::arch::OpMultiplyAdd
>::GemmKernel;
// Define named type
struct cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type :
public cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_base { };
extern "C"
__global__ void
cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1(cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type::Params params) {
// Dynamic shared memory base pointer
extern __shared__ int SharedStorageBase[];
// Declare pointer to dynamic shared memory.
cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type::SharedStorage *shared_storage =
reinterpret_cast<cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type::SharedStorage *>(SharedStorageBase);
cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1_type op;
op(params, *shared_storage);
}
Steps/Code to reproduce bug
I ran the script with CUDA_LAUNCH_BLOCKING=1 python3 test-cutlass-py.py. When running with cutlass_tensorop_s1688tf32gemm_128x128_16x3_nt_align1, we get the following output:
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7f7c33400000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7f7c33410000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7f7c33420000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7f7c33430000
Traceback (most recent call last):
File "examples/40_cutlass_py/test-cutlass-py.py", line 159, in <module>
raise RuntimeError('CUDA Error %s' % str(err))
RuntimeError: CUDA Error CUresult.CUDA_ERROR_ILLEGAL_ADDRESS
Expected behavior
No illegal address error should occur and the printed delta value should be close to zero. It works correctly when running with cutlass_simt_sgemm_128x128_8x4_nt_align1.
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7ff5cb400000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7ff5cb410000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7ff5cb420000
Tensor has dimensions: 16384 (4 bytes)
updating tensor in device memory 0x7ff5cb430000
Array A: [[ 1. -2. 0. ... -2. -4. -1.]
[ 2. -3. -1. ... 0. -3. -3.]
[ 2. 1. -1. ... -1. 0. -2.]
...
[-2. 1. -3. ... -1. -4. -4.]
[ 1. -4. -4. ... 0. -2. -3.]
[-1. -2. -1. ... -3. 2. -2.]]
Array B: [[-2. -1. -3. ... -3. -3. -4.]
[-1. 1. 2. ... -4. 1. 2.]
[ 2. 0. 2. ... 1. -4. 2.]
...
[ 3. 2. 1. ... 1. 2. -3.]
[-4. 3. 0. ... 1. -3. 2.]
[ 0. -4. -2. ... 0. -3. -4.]]
Array C: [[-4. 2. -1. ... 2. -1. -1.]
[-1. -4. 1. ... -1. 1. -2.]
[ 0. -3. 1. ... 1. -4. -1.]
...
[ 3. -1. 0. ... 0. -4. 3.]
[-2. 1. -4. ... 3. -2. -4.]
[ 2. 3. -3. ... 1. -4. 3.]]
Array D: [[ 44. 0. 38. ... -15. -44. -55.]
[ -46. -26. 162. ... 1. 6. -59.]
[ 12. 63. 80. ... 38. 65. -14.]
...
[ -5. -10. 22. ... 67. 51. 43.]
[ 109. -56. -27. ... 79. 28. 17.]
[-114. 144. 76. ... 42. 102. -54.]]
Reference: [[ 44. 0. 38. ... -15. -44. -55.]
[ -46. -26. 162. ... 1. 6. -59.]
[ 12. 63. 80. ... 38. 65. -14.]
...
[ -5. -10. 22. ... 67. 51. 43.]
[ 109. -56. -27. ... 79. 28. 17.]
[-114. 144. 76. ... 42. 102. -54.]]
Delta: [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
Environment details (please complete the following information):
- Environment location: Docker, AWS, SM80,
Additional context Add any other context about the problem here.
Upon trying some other problem sizes, I found that the SIMT kernel also fails in many cases https://gist.github.com/rkindi/9a25a6d1cbcb5a96167f38ed2fc6b3cc (not with address alignment issue, but with incorrect output).
In the linked gist, I tried several shapes and found the following of the shapes I tried:
# Works
# M, N, K = (8, 8, 8)
# M, N, K = (8, 32, 16)
# M, N, K = (16, 32, 16)
# M, N, K = (16, 128, 16)
# M, N, K = (16, 1024, 16)
# M, N, K = (32, 1024, 16)
# Does not work
# M, N, K = (32, 32, 32)
# M, N, K = (64, 1024, 16) - upper half of result is correct, but lower half is wrong
2.10 reimplemented pycutlass. please give it a try.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
@rkindi has your issue been solved?