[QST]Why does profiler run so many kernels?
hi all, I'm a newer to cutlass,
I run cutlass profiler like this
./tools/profiler/cutlass_profiler --op_class=tensorop --m=4352 --n=4096 --k=4096
It seems multiple cutlass kernels are run. Why is that? What is the detailed explanation of this?
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_c1688gemm_128x64_16x3_nn_align1
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=4352 --n=4096 --k=4096 --A=cf32:column --B=cf32:column --C=cf32:column --D=cf32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--op_class=tensorop --accum=cf32 --cta_m=128 --cta_n=64 --cta_k=16 --cluster_m=1 --cluster_n=1 --cluster_k=1 \
--stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=80 --max_cc=1024 \
Bytes: 419430400 bytes
FLOPs: 584258158592 flops
FLOPs/Byte: 1392
Runtime: 46.3135 ms
Memory: 8.43436 GiB/s
Math: 12615.3 GFLOP/s
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_c1688gemm_128x64_16x3_cn_align1
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=4352 --n=4096 --k=4096 --A=cf32:column --B=cf32:column --C=cf32:column --D=cf32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--op_class=tensorop --accum=cf32 --cta_m=128 --cta_n=64 --cta_k=16 --cluster_m=1 --cluster_n=1 --cluster_k=1 \
--stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=80 --max_cc=1024 \
Bytes: 419430400 bytes
FLOPs: 584258158592 flops
FLOPs/Byte: 1392
Runtime: 45.6831 ms
Memory: 8.55075 GiB/s
Math: 12789.4 GFLOP/s
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_c1688gemm_128x64_16x3_nc_align1
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=4352 --n=4096 --k=4096 --A=cf32:column --B=cf32:column --C=cf32:column --D=cf32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--op_class=tensorop --accum=cf32 --cta_m=128 --cta_n=64 --cta_k=16 --cluster_m=1 --cluster_n=1 --cluster_k=1 \
--stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=80 --max_cc=1024 \
Bytes: 419430400 bytes
FLOPs: 584258158592 flops
FLOPs/Byte: 1392
Runtime: 46.3142 ms
Memory: 8.43423 GiB/s
Math: 12615.1 GFLOP/s
...
The profiler will run any kernels capable of implementing the given problem requirements. In this case, the constraints are pretty general and many kernels will be able to support a problem size of that nature, so many are run. You can add additional constraints on data types and other features to narrow down the set of kernels further. See cutlass_profiler --help for more information!
The profiler will run any kernels capable of implementing the given problem requirements. In this case, the constraints are pretty general and many kernels will be able to support a problem size of that nature, so many are run. You can add additional constraints on data types and other features to narrow down the set of kernels further. See
cutlass_profiler --helpfor more information!
Thank you! I see, the data types and the layouts are not specified.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.