cutlass [QST]What is the difference between `TensorOp` and `WmmaTensorOp`

I'm reading this documentation: https://github.com/NVIDIA/cutlass/blob/main/media/docs/functionality.md And I see this:

TensorOp - Use Tensor Core MMA
SpTensorOp - Use Sparse Tensor Core MMA
WmmaTensorOp - Use WMMA abstraction to use Tensor Core MMA

What is the difference between TensorOp and WmmaTensorOp?

Jun 13 '24 09:06 sleepwalker2017

I dont work for NVIDIA but I can give my unofficial answer

First of all, they are two different ways to use tensor cores. Take a look at this. PTX provides two interfaces to tensor cores - wmma and mma. My understanding of the key difference between them is that wmma abstracts away the loading of matrix elements into registers in preparation for the tensor core instructions, whereas when using the mma API you have control over this, this control can be helpful for reducing shared memory bank conflicts if the data is coming from shared memory. The other difference is that wmma is part of CUDA C++ (here), whereas if you want to use the mma API you need to inline the PTX into your kernel. Basically wmma is more user friendly, portable, and potentially less performant, mma is less abstracted, gives you more fine grain control.

But I think part of why CUTLASS exists is to prevent you from having to worry about this type of distinction, so if you are choosing which kernel to run, i'd just pick the fastest!

Jun 19 '24 14:06 alexarmbr

Avoid using WMMA if you can

Jun 19 '24 14:06 thakkarV

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 14 '24 17:08 github-actions[bot]

Hello @thakkarV, when running cutlass_profiler, I found that *_sptensorop_* is geneally faster than *_tensorop_* when running a 4Kx4Kx4K GEMM. For example, I get optimal 860TFlops using _tensorop_ while get optimal 960TFlops using _sptensorop_. Why is SpTensorOp faster in dense-GEMM computation? Is it safe to always choose *_sptensorop_* to deal with a dense 4096x4096x4096 Gemm computation?

Nov 05 '24 04:11 ghostplant

sptensorop uses the structured sparse MMA, which is why you see it being faster

Nov 05 '24 15:11 thakkarV

sptensorop uses the structured sparse MMA, which is why you see it being faster

Thanks, that's reasonable if some area of GEMM inputs are sparse. But if considering a dense GEMM computation whose 2 inputs are standard random data without any sparse region, can _sptensorop_ still be faster than _tensorop_?

Nov 05 '24 20:11 ghostplant

Sparse GEMM forces structures sparsity. It's a totally different kernel and has implications on your workload characteristics.

Nov 05 '24 20:11 thakkarV

Sparse GEMM forces structures sparsity. It's a totally different kernel and has implications on your workload characteristics.

OK, does it mean that fully random GEMM operation (e.g. torch.matmul(x, y)) without any data sparsity cannot benefit from using *_sptensorop_*? In order words, cutlass_profiler reporting 960TFlops per 4Kx4Kx4K GEMM using SpTensorOp isn't fairly compared against *_tensorop_* which is just 860TFlops.

Nov 05 '24 21:11 ghostplant

right

Nov 05 '24 21:11 thakkarV

Thank you, then looks like 860Tflops is the peak that cutlass can achieve for dense GEMM.

Nov 05 '24 22:11 ghostplant

Yes on hopper that sounds about right.

Nov 05 '24 22:11 thakkarV