cutlass [BUG] Cutlass cannot be compiled with the `-G` flag for some of the latter examples.

Describe the bug

I managed to compile example 12 without issue, but now that I am trying to compile examples 59 and 79 with the 'G' flag either the compilation takes forever (for the former) or it outright crashes (for the latter.) As I am writing this, example 59 just finished compiling and the resulting executable is 99mb large. Here is the error I get when I try to compile example 79c.

(main) mrakgr@Marko:~/cutlass_studies$ pwsh build_example.ps1 
Compiling 'cutlass/examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu' into: bin/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm
warning: Register name %rd131879 is too large, generated debug information may be inaccurate.

ptxas info    : (C7505) Potential Performance Loss: 'setmaxnreg' ignored to allow debugging.
Killed

The purpose of this exercise is so I could step through the examples and get a sense of how the library works. Over a year ago, I spent a couple of days just reading code and could get anywhere as it's too large. And without a debugger, I cannot possibly figure out where all of those overloaded templates are pointing to.

Steps/Code to reproduce bug Here is the Powershell script I am using to compile the examples:

I got them by running cmake and then extracting the arguments from the makefile. Let me reprint it here:

# $example_file = "cutlass/examples/12_gemm_bias_relu/gemm_bias_relu.cu"
# $example_file = "cutlass/examples/59_ampere_gather_scatter_conv/ampere_gather_scatter_conv.cu"
$example_file = "cutlass/examples/79_blackwell_geforce_gemm/79c_blackwell_geforce_mixed_mxfp8_mxfp6_bf16_gemm.cu"
$output_file = "bin/$($example_file | Split-Path -LeafBase)"
$output_dir = $output_file | Split-Path -Parent
Write-Host "Compiling '$example_file' into: $output_file"
if (-not (Test-Path $output_dir)) {
    New-Item $output_dir -ItemType Directory
}
nvcc `
    -I"cutlass/include" `
    -I"cutlass/examples/common" `
    -I"cutlass/build/include" `
    -I"cutlass/tools/util/include" `
    -isystem /usr/local/cuda/include `
    -isystem /usr/local/cuda/include/cccl `
    -g -G `
    -arch=sm_120a `
    -std=c++17 `
    --expt-relaxed-constexpr `
    -DCUTLASS_VERSIONS_GENERATED `
    -DNDEBUG `
    -Xcompiler=-fPIE `
    -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 `
    -DCUTLASS_ENABLE_GDC_FOR_SM100=1 `
    -ftemplate-backtrace-limit=0 `
    -DCUTLASS_TEST_LEVEL=0 `
    -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 `
    -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1 `
    -DCUTLASS_DEBUG_TRACE_LEVEL=0 `
    -Xcompiler=-Wconversion `
    -Xcompiler=-fno-strict-aliasing `
    -o $output_file `
    $example_file

if ($?) { # Runs the file if the compilation was successful.
    Write-Host "Done Compiling: $output_file"
    Write-Host "Running: $output_file" && . $output_file
}

This requires Cutlass and Powershell to be installed, which can be done by running the following commands in the repo directory.

bash install_powershell.sh
git clone https://github.com/NVIDIA/cutlass/

Expected behavior The program should compile with debug info.

Environment details (please complete the following information):

Environment location: Local with Cuda 12.9 SDK installed. 32gb RAM.

Jun 16 '25 15:06 mrakgr

Please try using CMAKE_BUILD_TYPE=RelWithDebInfo which will use -g -lineinfo options. CUTLASS kernels need to compile with many complex optimizations to make full use of the hardware and have not been tuned/fixed to work with a debug build.

With line info, you should at least be able to step through and see where the code is coming from to give an idea of where to look for more details.

Jun 16 '25 17:06 d-k-b

I didn't know about CMAKE_BUILD_TYPE=RelWithDebInfo, but I did try -g -lineinfo. The problem with the -lineinfo is that the breakpoints didn't work inside the device kernel and neither did stepping through the code. Whether the optimizations work in a debug build isn't really important, but being able to step through the code is as Cutlass makes such heavy use of template overloading.

Jun 17 '25 07:06 mrakgr

Yes, this is a known issue and over time the compiler team makes improvements but getting full debug builds working that doesn't compromise on optimizations is still a work in progress.

Jul 02 '25 16:07 d-k-b

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 22 '25 14:08 github-actions[bot]

@mrakgr The issue is fixed in 4.3.0 dev release.

Oct 22 '25 03:10 jwu1980

This is a pleasant surprise., thanks for taking care of it. I didn't think this would be worked on.

Oct 22 '25 07:10 mrakgr

@mrakgr The issue is fixed in 4.3.0 dev release.

Hello, I have upgraded CUTLASS to version 4.3.1. but still cannot be compiled with the -G flag for 72a_blackwell_nvfp4_bf16_gemm. Here is the infor：

Use the following command(s):

nvcc -o 72a_blackwell_nvfp4_bf16_gemm 72a_blackwell_nvfp4_bf16_gemm.cu \
    -I/home/tmp/cutlass \
    -I/home/tmp/cutlass/include \
    -I/home/tmp/cutlass/examples/common \
    -I/home/tmp/cutlass/tools/util/include \
    --expt-relaxed-constexpr \
    -g -G \
    --keep --keep-dir=nvcc_keep1 \
    --verbose \
    -Xptxas -v \
    -arch=sm_100a \
    -Xcompiler "-v" 2>&1 | tee nvcc_build.log

The nvcc compilation has got stuck at ptxas -arch=sm_100a XXX.

'-mtune=generic' '-march=x86-64' '-dumpdir' 'nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.cpp1.'
#$ "$CICC_PATH/cicc" --c++17 --static-host-stub --device-hidden-visibility --gnu_version=130300 --display_error_number --orig_src_file_name "72a_blackwell_nvfp4_bf16_gemm.cu" --orig_src_path_name "/home/tmp/cutlass/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu" --allow_managed --debug_mode --relaxed_constexpr   -arch compute_100a -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "72a_blackwell_nvfp4_bf16_gemm.fatbin.c" -g -O0 -tused --module_id_file_name "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.module_id" --gen_c_file_name "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.cudafe1.c" --stub_file_name "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.cudafe1.stub.c" --gen_device_file_name "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.cudafe1.gpu"  "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.cpp1.ii" -o "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.ptx"
warning: Register name %rd225910 is too large, generated debug information may be inaccurate.

#$ ptxas -arch=sm_100a -m64 -v  -g --dont-merge-basicblocks --return-at-end "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.ptx"  -o "nvcc_keep1/72a_blackwell_nvfp4_bf16_gemm.compute_100a.sm_100a.cubin"

Dec 02 '25 08:12 linshuijin

@linshuijin Sorry for the confusion. This is a known issue of our compiler team to make the whole debug build work. Actually it was't fixed in 4.3.

Dec 03 '25 00:12 Junkai-Wu