cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] Why cutlass profiler not profile all kernels?

Open YSF-A opened this issue 7 months ago • 11 comments

Hello, with some modification such as ElementC, LayoutA, LayoutB, I can run the exmaple https://github.com/NVIDIA/cutlass/blob/main/examples/70_blackwell_gemm/70_blackwell_fp8_gemm.cu successfully. But with the same problem size, the cutlass profiler does not profile the kernel which is executed in the modified 70_blackwell_fp8_gemm.cu. And I would like to know in which way I can profile all possible kernels?

Similarly, I execute the following test, which I think is same as the kernel in the modified 70_blackwell_fp8_gemm.cu, and return nothing. cutlass_profiler --operation=Gemm --m=${m} --n=${n} --k=${k} --alpha=1.0 --beta=0.0 --A=f8:row --B=f8:row --C=f16:row --D=f16:row --batch_count=1 --raster_order=heuristic --accum=f32 --profiling-iterations=100 --cluster_m=2 --cluster_n=2 --cluster_k=1 --inst_m=256 --inst_n=128 --inst_k=64

By the way, I compile cutlass profiler with -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON

Thanks

YSF-A avatar May 30 '25 07:05 YSF-A

please post your full repro steps starting with git clone as well as your env details like GPU, CTK, OS, compiler etc

thakkarV avatar May 30 '25 13:05 thakkarV

Hi @thakkarV , thanks for your reply.

I clone cutlass repo and checkout to 3.9.2. I modify the 70_blackwell_fp8_gemm.cu like modified_70_blackwell_fp8_gemm.txt

Then I compile cutlass_profiler and the modified 70_blackwell_fp8_gemm. And I copy the build/tools/library/*.so, cutlass_profiler and 70_blackwell_fp8_gemm to the 101a device. CUDACXX=${nvcc_path} cmake .. -DCMAKE_C_COMPILER=${cross_compile_c_compiler_path} -DCMAKE_CXX_COMPILER=${cross_compile_cpp_compiler_path} -DCMAKE_TOOLCHAIN_FILE=${cross_compile_toolchain_path} -DCUTLASS_NVCC_ARCHS=101a -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON make 70_blackwell_fp8_gemm make cutlass_profiler

On 101a, I run 70_blackwell_fp8_gemm successfully. And I execute cutlass_profiler in which I think the problem size is the same as the 70_blackwell_fp8_gemm . I add the ignore-kernels because they failed in profile. LD_LIBRARY_PATH=${cutlass_profiler_lib_path} ./cutlass_profiler --operation=Gemm --m=1024 --n=512 --k=1024 --alpha=1.0 --beta=0.0 --A=f8:row --B=f8:row --C=f16:row --D=f16:row --batch_count=1 --raster_order=heuristic --accum=f32 --profiling-iterations=100 --ignore-kernels=cutlass3x_sm100_tensorop_s64x128x32gemm_f8_f8_f32_f16_f16_256x512x128_4x4x1_0_ttt_align16_1sm,cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_512x512x128_4x4x1_0_ttt_align16_1sm,cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_256x512x128_4x4x1_0_ttt_align16_2sm_epi_tm

There are only the following kernels in the output.

cutlass3x_sm100_tensorop_s64x128x32gemm_f8_f8_f32_f16_f16_64x256x128_1x2x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s64x128x32gemm_f8_f8_f32_f16_f16_128x128x128_2x1x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s64x128x32gemm_f8_f8_f32_f16_f16_64x128x128_1x1x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s64x128x32gemm_f8_f8_f32_f16_f16_64x512x128_1x4x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_128x256x128_1x2x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_256x128x128_2x1x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_128x128x128_1x1x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_128x512x128_1x4x1_0_ttt_align16_1sm
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_128x128x128_2x1x1_0_ttt_align16_2sm_epi_tma
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_128x256x128_2x2x1_0_ttt_align16_2sm_epi_tma
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_128x512x128_2x4x1_0_ttt_align16_2sm_epi_tma
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_256x128x128_4x1x1_0_ttt_align16_2sm_epi_tma
cutlass3x_sm100_tensorop_s128x128x32gemm_f8_f8_f32_f16_f16_256x256x128_4x2x1_0_ttt_align16_2sm_epi_tma

And it seems that the same config of 70_blackwell_fp8_gemm is not profiled by cutlass_profiler.

By the way, I reproduce the similar question on the 70_blackwell_fp8_gemm without modification. I am not sure how to profile correctly.

Thank you for your help.

YSF-A avatar Jun 03 '25 06:06 YSF-A

what do you see if you remove -DCUTLASS_LIBRARY_KERNELS=all and then try to build again?

also CC @Junkai-Wu for additional set of eyes. sorry my responses my be delayed

thakkarV avatar Jun 03 '25 14:06 thakkarV

Do you see the same kernel configuration as your modifications to the example in the cutlass library? The examples are separate from the library/profiler and thus the same kernels may not exist (by default at least) in both.

https://github.com/NVIDIA/cutlass/blob/9d165a3b8ef446a7ff3db198413f82bcb83f46fe/python/cutlass_library/generator.py#L7188

d-k-b avatar Jun 03 '25 17:06 d-k-b

Hi, if you don't modify any code, will any errors occur

AnnaTrainingG avatar Jun 04 '25 10:06 AnnaTrainingG

Hi, if you don't modify any code, will any errors occur

Yes, I meet similar problem if I modify nothing

YSF-A avatar Jun 04 '25 10:06 YSF-A

Do you see the same kernel configuration as your modifications to the example in the cutlass library? The examples are separate from the library/profiler and thus the same kernels may not exist (by default at least) in both.

cutlass/python/cutlass_library/generator.py

Line 7188 in 9d165a3

def GenerateSM100_TensorOp_fp8_UMMA_gemm(manifest, cuda_version, gemm_kind=GemmKind.Universal3x):

Hi @d-k-b , thanks for your reply.

There is no same kernel configuration in generator.py.

I notice that the cutlass profiler do not cover all possible kernels which is also dicussed in https://github.com/NVIDIA/cutlass/issues/809#issuecomment-1424262071

I would like to know is there a way to profile all kernels? It seems difficult for me to try to add all possible kernels to generator.py because some configuration may be not correct or some configuration may be missed by me. Or maybe the default kernels profiled by cutlass profiler have high performance and others are slow so there is no need to profile other kernels ?

By the way, I would like to confirm another confusion. I want to profile a kernel with ElementA = cutlass::float_e4m3_t, which one of --A=f8 of --A=fe4m3 should I use when execute cutlass_profiler?

Thank you.

YSF-A avatar Jun 04 '25 10:06 YSF-A

--A=fe4m3 should be used, not f8

thakkarV avatar Jun 04 '25 13:06 thakkarV

@YSF-A You need to specify --cluster_m_fallback, --cluster_n_fallback, --cluster_k_fallback in your profiler command, or they are initialized to be 0 which will cause issue for dynamic cluster kernel. You can add --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 in your command to see if profiles kernel correctly.

Junkai-Wu avatar Jun 04 '25 14:06 Junkai-Wu

--A=fe4m3 should be used, not f8

@YSF-A You need to specify --cluster_m_fallback, --cluster_n_fallback, --cluster_k_fallback in your profiler command, or they are initialized to be 0 which will cause issue for dynamic cluster kernel. You can add --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 in your command to see if profiles kernel correctly.

Hi, sorry for my late reply. With above help, I can profile correctly and use the same configuration of cutlass_profiler in my code. But I am still unsure how to profile all possible kernels. Thank you all for your help.

YSF-A avatar Jun 06 '25 14:06 YSF-A

I met the same problem. On my NVIDIA Jetson AGX Orin, I clone CUTLASS:

git clone https://github.com/NVIDIA/cutlass.git

and checkout to v3.1.0

cd cutlass && git checkout v3.1.0

follow the instruction in media/docs/profiler.md, I build cutlass_profiler by

#!/bin/bash mkdir -p build && cd build

cmake .. -DCUTLASS_NVCC_ARCHS="80;86;87" \ -DCUTLASS_LIBRARY_KERNELS=all \ -DCUTLASS_UNITY_BUILD_ENABLED=ON \ -DCUTLASS_LIBRARY_OPERATIONS=gemm \ -DCUTLASS_ENABLE_TESTS=OFF \ -DCUTLASS_ENABLE_PROFILER=ON

make cutlass_profiler -j$(nproc)

then I run

cd build/tools/profiler ./cutlass_profiler --operation=Gemm --D=*:row

and get nothing output. I run

echo $?

and get 0.

I try to remove -DCUTLASS_LIBRARY_KERNELS=all, and

In CUTLASS I can use GEMM kernel whose layout of C(D) is row major, but cutlass_profiler doesn't profile any this kind of GEMM kernels. Why?

As @YSF-A , I wonder how to profile all possible kernels, too.

liuyao0 avatar Jun 09 '25 02:06 liuyao0

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jul 09 '25 03:07 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Oct 07 '25 04:10 github-actions[bot]