cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST]Why does profiler run so many kernels?

Open sleepwalker2017 opened this issue 1 year ago • 2 comments

hi all, I'm a newer to cutlass, I run cutlass profiler like this ./tools/profiler/cutlass_profiler --op_class=tensorop --m=4352 --n=4096 --k=4096

It seems multiple cutlass kernels are run. Why is that? What is the detailed explanation of this?

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_c1688gemm_128x64_16x3_nn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4352 --n=4096 --k=4096 --A=cf32:column --B=cf32:column --C=cf32:column --D=cf32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=cf32 --cta_m=128 --cta_n=64 --cta_k=16 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=80 --max_cc=1024  \


           Bytes: 419430400  bytes
           FLOPs: 584258158592  flops
           FLOPs/Byte: 1392

         Runtime: 46.3135  ms
          Memory: 8.43436 GiB/s

            Math: 12615.3 GFLOP/s


=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_c1688gemm_128x64_16x3_cn_align1

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4352 --n=4096 --k=4096 --A=cf32:column --B=cf32:column --C=cf32:column --D=cf32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=cf32 --cta_m=128 --cta_n=64 --cta_k=16 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=80 --max_cc=1024  \


           Bytes: 419430400  bytes
           FLOPs: 584258158592  flops
           FLOPs/Byte: 1392

         Runtime: 45.6831  ms
          Memory: 8.55075 GiB/s

            Math: 12789.4 GFLOP/s



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_c1688gemm_128x64_16x3_nc_align1

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4352 --n=4096 --k=4096 --A=cf32:column --B=cf32:column --C=cf32:column --D=cf32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --op_class=tensorop --accum=cf32 --cta_m=128 --cta_n=64 --cta_k=16 --cluster_m=1 --cluster_n=1 --cluster_k=1  \
                  --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=80 --max_cc=1024  \


           Bytes: 419430400  bytes
           FLOPs: 584258158592  flops
           FLOPs/Byte: 1392

         Runtime: 46.3142  ms
          Memory: 8.43423 GiB/s

            Math: 12615.1 GFLOP/s

...

sleepwalker2017 avatar Jun 13 '24 08:06 sleepwalker2017

The profiler will run any kernels capable of implementing the given problem requirements. In this case, the constraints are pretty general and many kernels will be able to support a problem size of that nature, so many are run. You can add additional constraints on data types and other features to narrow down the set of kernels further. See cutlass_profiler --help for more information!

d-k-b avatar Jun 13 '24 15:06 d-k-b

The profiler will run any kernels capable of implementing the given problem requirements. In this case, the constraints are pretty general and many kernels will be able to support a problem size of that nature, so many are run. You can add additional constraints on data types and other features to narrow down the set of kernels further. See cutlass_profiler --help for more information!

Thank you! I see, the data types and the layouts are not specified.

sleepwalker2017 avatar Jun 14 '24 02:06 sleepwalker2017

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jul 14 '24 03:07 github-actions[bot]