cutlass [BUG] Illegal instruction on H100 TMA

Hi,

I am trying to do a very simple TMA load on H100 and my code is producing an illegal instruction exception.

Small reproducer:

import torch
import cutlass
import cutlass.cute as cute
import cutlass.utils as utils
from cutlass.cute.runtime import from_dlpack

M, K = 128, 64

@cute.kernel
def copy_kernel(tma_atom_a: cute.CopyAtom,
                mA_tma: cute.Tensor,
                sA_layout: cute.ComposedLayout):
    tidx, _, _ = cute.arch.thread_idx()

    smem = cutlass.utils.SmemAllocator()
    sA = smem.allocate_tensor(cutlass.BFloat16, sA_layout.outer, 16, sA_layout.inner)
    mbar = smem.allocate_array(cutlass.Uint64, 1)

    if tidx == 0:
        cute.arch.mbarrier_init_arrive_cnt(mbar, 1)
        cute.arch.mbarrier_init_fence()
        cute.arch.mbarrier_init_tx_bytes(mbar, M * K * 2)

    cute.arch.sync_threads()

    tAsA, tAgA = cute.nvgpu.cpasync.tma_partition(
        atom=tma_atom_a,
        cta_coord=(0, 0),
        cta_layout=cute.make_layout((1, 1)),
        smem_tensor=sA,
        gmem_tensor=mA_tma
    )

    if tidx == 0:
        # Causes illegal isntruction exception, goes away if commented out
        cute.copy(
            tma_atom_a,
            tAgA,
            tAsA,
            tma_bar_ptr=mbar
        )

    # Hangs indefinitely if this is uncommented
    # cute.arch.mbarrier_wait(mbar, 0)

@cute.jit
def launch_copy(mA: cute.Tensor):
    sw128_k_atom = cute.nvgpu.warpgroup.make_smem_layout_atom(
        kind=cute.nvgpu.warpgroup.SmemLayoutAtomKind.K_SW128,
        element_type=cutlass.BFloat16
    )

    sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1))

    basic_tma_op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp()
    tma_atom_a, tma_tensor_a = cute.nvgpu.cpasync.make_tma_tile_atom(
        op=basic_tma_op,
        gmem_tensor=mA,
        smem_layout=sA_layout,
        cta_tiler=(M, K)
    )

    smem_size = cute.size_in_bytes(cutlass.BFloat16, sA_layout) + 8

    copy_kernel(
        tma_atom_a,
        tma_tensor_a,
        sA_layout
    ).launch(
        grid=(1, 1, 1),
        block=(128, 1, 1),
        smem=smem_size
    )

A = torch.randn(M, K, dtype=torch.bfloat16, device="cuda")
A_tensor = from_dlpack(A, assumed_align=16)

cutlass.cuda.initialize_cuda_context()
copy = cute.compile(launch_copy, A_tensor)
copy(A)
print("launched")

torch.cuda.synchronize()

It's pretty likely that I am doing something wrong here, but it's a bit tricky to debug an illegal instruction exception. I've tried to look at this: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/hopper/wgmma_tma_sm90.cu and replicate similar behavior but in the Python DSL, but something seems to be going wrong. Any ideas?

Thanks!

Jun 04 '25 20:06 axelfeldmann

2 things should be fixed:

tile_to_shape order should be (1,0), meaning 2nd mode K is contiguous dimensio
tma_partition need use cute.group_modes(sA, 0, 2) for both sA and mA, because it is marking atom of a copy out of the rest/stage/K dimensions.

Jun 06 '25 02:06 lijingticy22

3rd thing need be changed to make you pass mbarrier_wait is, change "if tidx == 0:" to "if tidx < 32:", internally in cute.copy implementation for tma_copy, we would have a nested if to select a single thread to issue tma copy instruction. Having nested if to select a single thread sometimes cause bugs.

Jun 06 '25 02:06 lijingticy22

Thanks! I think this works (at least it doesn't crash 😄 )

Question: are there any docs on these things? I particularly don't understand tma_partition where the docs seem pretty completely missing: https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_cpasync.html#cutlass.cute.nvgpu.cpasync.tma_partition

Jun 07 '25 13:06 axelfeldmann

Another question:

It seems like sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (1, 0)) and sA_wrong = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1)) do the same exact thing? They both seem to be:

S<3,4,3> o 0 o ((8,16),(64,1)):((64,512),(1,0))
S<3,4,3> o 0 o ((8,16),(64,1)):((64,512),(1,0))

Jun 07 '25 13:06 axelfeldmann

Question: are there any docs on these things? Sorry, we do not yet have doc for tma_partition, we will work on it in next releases.

For your question, the smem tensor and global memory tensor passed to tma_partition op need to grouped to be (TMATile, Rest...) dimensions, the 1st mode need be the tile that you want the tma_copy operation to copy. In gemm case, the TMATile is the blockMxBlockK shape, and the Rest is the RestK and RestL (step on K and L dimensions).

Jun 09 '25 05:06 lijingticy22

It seems like sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (1, 0)) and sA_wrong = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1))

This is because the contiguous dimension K in your case is exactly 64, which matches the sw128_k_atom's contiguous dimension ( S<3,4,3> o 0 o (8,64):(64,1)). If you change the contigous dimension K from 64 to 128, you can see the difference: sA_layout goes along K/64 dimension first then M/8, sA_wrong goes along M/8 dimension first then K/64.

sA_layout: S<3,4,3> o 0 o ((8,16),(64,2)):((64,1024),(1,512))
sA_wrong: S<3,4,3> o 0 o ((8,16),(64,2)):((64,512),(1,8192))

Jun 09 '25 05:06 lijingticy22

Close the issue due to the issue is identified in user's program. @axelfeldmann please reopen the issue if needed.

Jul 07 '25 09:07 jwu1980