[BUG] Illegal instruction on H100 TMA
Hi,
I am trying to do a very simple TMA load on H100 and my code is producing an illegal instruction exception.
Small reproducer:
import torch
import cutlass
import cutlass.cute as cute
import cutlass.utils as utils
from cutlass.cute.runtime import from_dlpack
M, K = 128, 64
@cute.kernel
def copy_kernel(tma_atom_a: cute.CopyAtom,
mA_tma: cute.Tensor,
sA_layout: cute.ComposedLayout):
tidx, _, _ = cute.arch.thread_idx()
smem = cutlass.utils.SmemAllocator()
sA = smem.allocate_tensor(cutlass.BFloat16, sA_layout.outer, 16, sA_layout.inner)
mbar = smem.allocate_array(cutlass.Uint64, 1)
if tidx == 0:
cute.arch.mbarrier_init_arrive_cnt(mbar, 1)
cute.arch.mbarrier_init_fence()
cute.arch.mbarrier_init_tx_bytes(mbar, M * K * 2)
cute.arch.sync_threads()
tAsA, tAgA = cute.nvgpu.cpasync.tma_partition(
atom=tma_atom_a,
cta_coord=(0, 0),
cta_layout=cute.make_layout((1, 1)),
smem_tensor=sA,
gmem_tensor=mA_tma
)
if tidx == 0:
# Causes illegal isntruction exception, goes away if commented out
cute.copy(
tma_atom_a,
tAgA,
tAsA,
tma_bar_ptr=mbar
)
# Hangs indefinitely if this is uncommented
# cute.arch.mbarrier_wait(mbar, 0)
@cute.jit
def launch_copy(mA: cute.Tensor):
sw128_k_atom = cute.nvgpu.warpgroup.make_smem_layout_atom(
kind=cute.nvgpu.warpgroup.SmemLayoutAtomKind.K_SW128,
element_type=cutlass.BFloat16
)
sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1))
basic_tma_op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp()
tma_atom_a, tma_tensor_a = cute.nvgpu.cpasync.make_tma_tile_atom(
op=basic_tma_op,
gmem_tensor=mA,
smem_layout=sA_layout,
cta_tiler=(M, K)
)
smem_size = cute.size_in_bytes(cutlass.BFloat16, sA_layout) + 8
copy_kernel(
tma_atom_a,
tma_tensor_a,
sA_layout
).launch(
grid=(1, 1, 1),
block=(128, 1, 1),
smem=smem_size
)
A = torch.randn(M, K, dtype=torch.bfloat16, device="cuda")
A_tensor = from_dlpack(A, assumed_align=16)
cutlass.cuda.initialize_cuda_context()
copy = cute.compile(launch_copy, A_tensor)
copy(A)
print("launched")
torch.cuda.synchronize()
It's pretty likely that I am doing something wrong here, but it's a bit tricky to debug an illegal instruction exception. I've tried to look at this: https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/hopper/wgmma_tma_sm90.cu and replicate similar behavior but in the Python DSL, but something seems to be going wrong. Any ideas?
Thanks!
2 things should be fixed:
- tile_to_shape order should be (1,0), meaning 2nd mode K is contiguous dimensio
- tma_partition need use cute.group_modes(sA, 0, 2) for both sA and mA, because it is marking atom of a copy out of the rest/stage/K dimensions.
3rd thing need be changed to make you pass mbarrier_wait is, change "if tidx == 0:" to "if tidx < 32:", internally in cute.copy implementation for tma_copy, we would have a nested if to select a single thread to issue tma copy instruction. Having nested if to select a single thread sometimes cause bugs.
Thanks! I think this works (at least it doesn't crash 😄 )
Question: are there any docs on these things? I particularly don't understand tma_partition where the docs seem pretty completely missing: https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl_api/cute_nvgpu_cpasync.html#cutlass.cute.nvgpu.cpasync.tma_partition
Another question:
It seems like sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (1, 0)) and
sA_wrong = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1))
do the same exact thing? They both seem to be:
S<3,4,3> o 0 o ((8,16),(64,1)):((64,512),(1,0))
S<3,4,3> o 0 o ((8,16),(64,1)):((64,512),(1,0))
Question: are there any docs on these things? Sorry, we do not yet have doc for
tma_partition, we will work on it in next releases.
For your question, the smem tensor and global memory tensor passed to tma_partition op need to grouped to be (TMATile, Rest...) dimensions, the 1st mode need be the tile that you want the tma_copy operation to copy. In gemm case, the TMATile is the blockMxBlockK shape, and the Rest is the RestK and RestL (step on K and L dimensions).
It seems like sA_layout = cute.tile_to_shape(sw128_k_atom, (M, K), (1, 0)) and sA_wrong = cute.tile_to_shape(sw128_k_atom, (M, K), (0, 1))
This is because the contiguous dimension K in your case is exactly 64, which matches the sw128_k_atom's contiguous dimension ( S<3,4,3> o 0 o (8,64):(64,1)). If you change the contigous dimension K from 64 to 128, you can see the difference: sA_layout goes along K/64 dimension first then M/8, sA_wrong goes along M/8 dimension first then K/64.
sA_layout: S<3,4,3> o 0 o ((8,16),(64,2)):((64,1024),(1,512))
sA_wrong: S<3,4,3> o 0 o ((8,16),(64,2)):((64,512),(1,8192))
Close the issue due to the issue is identified in user's program. @axelfeldmann please reopen the issue if needed.