cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛

Open warpuv opened this issue 2 months ago • 5 comments

Description of the bug:

Affected versions are 3.4.0 and 3.6.0 and in between.

When using example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, and linking with some other code (attached to this report) I've got the error:

/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=507904
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=1048576, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=2097152
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=240, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=960
/workspace/src/cutlass/include/cutlass/gemm/kernel/params_universal_base.h:95  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:191  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:310  GemmUniversalBase::get_workspace_size()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:190    device_ordinal: (0), device_sms: (108), sm_occupancy: (2) smem_size: (81920) GemmKernel::kThreadCount: (128)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:321    workspace_bytes: 0
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=0
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:271  GemmUniversalBase::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:329  GemmUniversalBase::get_grid_shape()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:342    tiled_shape: cutlass::gemm::GemmCoord {2,2,1}
  grid_dims: {2, 2, 1}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:368  GemmUniversal::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:438    returning kSuccess
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:406  GemmUniversalBase::initialize() - workspace 0, stream: null
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:434  GemmUniversalBase::run()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:444    grid: (2, 2, 1), block: (128, 1, 1), SMEM: (81920)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:466    grid launch failed with error invalid argument
Got cutlass error: Error Internal at: 387

Steps to reproduce:

  1. I've taken the official example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, this file remains unchanged.
  2. Just use attached gather_scatter_fusion1.cu or: 2.1) Made a copy of the example into gather_scatter_fusion1.cu. 2.2) Remove the main function and rename run function to run2 in gather_scatter_fusion1.cu. 2.3) Remove some code from the ending of the function run2 until status = gemm_op(); statement, not keeping it.
  3. Compile both .cu files into one executable & run.

IMPORTANT: gather_scatter_fusion1.cu MUST be the first in nvcc command to reproduce the error, otherwise everything works fine!

nvcc -O0 -arch=native -ccbin=clang --expt-relaxed-constexpr -I./cutlass/include -I./cutlass/tools/util/include gather_scatter_fusion1.cu gather_scatter_fusion.cu -o out
./out

Output result:

Got cutlass error: Error Internal at: 387

But expected:

Passed!
Runtime: 0.0388416 ms
 GFLOPs: 3138.31

Additional notes:

  1. IMPORTANT: Only in case when the template parameters of cutlass::gemm::device::GemmUniversal are exactly the same in both .cu files the error is occurring (that means the sass code exactly the same in both .o files).
  2. The error occurs on clang with -O2 or -O0 flags, and on gcc with -O0. (gcc with -O2 runs as expected at least with this version of code)
  3. None of the functions are called from gather_scatter_fusion1.cu during the test, existence of the function “run2” is enough to break the program.
  4. The last CUDA API call is the cudaLaunchKernel, in debugger the arguments to it looks reasonable.
  5. cudaGetLastError() returns cudaErrorInvalidValue
  6. I've found the problematic commit using git bisect, it is 8236f30675bbe98f81d11c05764b77bfcb25b8cc (this is release of 3.4.0 version),
  7. Since the source code of the individual commits of this huge PR is not available I cannot investigate the error further.

Environment:

GPU: A100 nvidia-smi: 470.161.03 CUDA Version: 11.4

gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) clang version: 18.1.3 (1ubuntu1)

Docker container: nvcr.io/nvidia/cuda:12.6.3-devel-ubuntu24.04

Also reproduced on: nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 with corresponding default versions of tools/compilers.

gather_scatter_fusion1.cu.txt gather_scatter_fusion.cu.txt

cc: @IonThruster

warpuv avatar Dec 28 '24 17:12 warpuv