cutlass
cutlass copied to clipboard
[QST] The strange bank conflict in the CuTeDSL python gemm demo.
code
https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/ampere/tensorop_gemm.py
env
pip list | grep -i cutlass
nvidia-cutlass-dsl 4.3.0
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Question
I found bank conflicts using ncu while running the demo.
ncu --metrics "l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st" \
python notebooks/hello_world/7.mem_test/baseline.py \
--mnkl 2048,2048,1024,1 \
--atom_layout_mnk 2,2,1 \
--ab_dtype Float16 \
--c_dtype Float16 \
--acc_dtype Float32 \
--skip_ref_check \
--iterations 1 \
--warmup_iterations 0 \
--a_major m --b_major k --c_major n \
--use_cold_l2 \
2>&1 | tee log.txt
so I limit the grid size to (1, 1, 1) to debug
rasterization_remap_grid_dim = (1, 1, 1)
self.kernel(
mA,
mB,
mC,
sA_layout,
sB_layout,
sC_layout,
tiled_copy_A,
tiled_copy_B,
tiled_copy_C,
tiled_mma,
raster_factor,
epilogue_op,
).launch(
grid=rasterization_remap_grid_dim,
block=[self.num_threads, 1, 1],
smem=smem_size,
)
I found when using --atom_layout_mnk 1,1,1 there are no no bank conflicts occur, when using--atom_layout_mnk 2,1,1 or --atom_layout_mnk 2,2,1 there are bank conficts.
then I found, if using this debug grid dim
rasterization_remap_grid_dim = (cute.size(grid_dim[0]) * raster_factor, 1, 1)
using --atom_layout_mnk 1,1,1, there are bank conficts
logs
--atom_layout_mnk 1,1,1
ncu --metrics "l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st" \
python notebooks/hello_world/7.mem_test/baseline.py \
--mnkl 2048,2048,1024,1 \
--atom_layout_mnk 1,1,1 \
--ab_dtype Float16 \
--c_dtype Float16 \
--acc_dtype Float32 \
--skip_ref_check \
--iterations 1 \
--warmup_iterations 0 \
--a_major m --b_major k --c_major n \
--use_cold_l2 \
2>&1 | tee log.txt
==PROF== Connected to process 204017 (/usr/bin/python3.12)
==PROF== Profiling "kernel_cutlass_kernel___main_..." - 0: 0%....50%....100% - 1 pass
Running Ampere tensor core GEMM example:
mnkl: (2048, 2048, 1024, 1)
A dtype: Float16, B dtype: Float16, C dtype: Float16, Acc dtype: Float32
Matrix majors - A: m, B: k, C: n
Atoms layout: (1, 1, 1)
Warmup iterations: 0
Iterations: 1
Skip reference checking: True
Use cold L2: True
Compiling kernel with cute.compile ...
Executing GEMM kernel...
PASS
==PROF== Disconnected from process 204017
[204017] [email protected]
kernel_cutlass_kernel___main__TensorOpGemm_object_at__tensorptrf16gmemalign16odiv81i64div8i64div8_tensorptrf16gmemalign16odiv8i64div81i64div8_tensorptrf16gmemalign16odiv8i64div81i64div8_S_0 (1, 1, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 11.0
Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
Section: Command line profiler metrics
-------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------------------------------------- ----------- ------------
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum 4416
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 0
-------------------------------------------------------- ----------- ------------
--atom_layout_mnk 2,1,1
ncu --metrics "l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st" \
python notebooks/hello_world/7.mem_test/baseline.py \
--mnkl 2048,2048,1024,1 \
--atom_layout_mnk 2,1,1 \
--ab_dtype Float16 \
--c_dtype Float16 \
--acc_dtype Float32 \
--skip_ref_check \
--iterations 1 \
--warmup_iterations 0 \
--a_major m --b_major k --c_major n \
--use_cold_l2 \
2>&1 | tee log.txt
==PROF== Connected to process 204577 (/usr/bin/python3.12)
==PROF== Profiling "kernel_cutlass_kernel___main_..." - 0: 0%....50%....100% - 1 pass
Running Ampere tensor core GEMM example:
mnkl: (2048, 2048, 1024, 1)
A dtype: Float16, B dtype: Float16, C dtype: Float16, Acc dtype: Float32
Matrix majors - A: m, B: k, C: n
Atoms layout: (2, 1, 1)
Warmup iterations: 0
Iterations: 1
Skip reference checking: True
Use cold L2: True
Compiling kernel with cute.compile ...
Executing GEMM kernel...
PASS
==PROF== Disconnected from process 204577
[204577] [email protected]
kernel_cutlass_kernel___main__TensorOpGemm_object_at__tensorptrf16gmemalign16odiv81i64div8i64div8_tensorptrf16gmemalign16odiv8i64div81i64div8_tensorptrf16gmemalign16odiv8i64div81i64div8_S_0 (1, 1, 1)x(64, 1, 1), Context 1, Stream 7, Device 0, CC 11.0
Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
Section: Command line profiler metrics
-------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------------------------------------- ----------- ------------
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum 6508
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.avg 0.60
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.max 12
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 12
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.avg 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.max 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.min 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 0
-------------------------------------------------------- ----------- ------------
Swizzle is enabled in the demo code, in this case it appears can avoid all bank conflicts, but they still occur under certain configurations.