cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
**What is your question?** I recently tried to change the type tags on the [DGEMM examples](https://github.com/NVIDIA/cutlass/blob/main/examples/45_dual_gemm/dual_gemm.cu) to ```cutlass::arch::Sm90```, which caused a load of compile errors. This is primarily because there's...
I'm adding (PR [here](https://github.com/pytorch/pytorch/pull/119986)) CUTLASS kernels as an auto-tune option for PyTorch compiler, and it would be nice to have these additional configurations available. This is not urgent, and more...
**What is your question?** Hello, I found that many epilogues are element-wise. I wondered if it could be customized to sum up a `2*2` tile instead of an element-wise operation....
I have implemented a basic sample code to convolve a 2D image with a row filter. It works, but when the dst image has some stride, it seems ignored by...
Is s8 * s8 = {s32, s8} supported in cuTe?
I am benchmarking sparse and dense GEMMs through the cutlass profiler. I am seeing that sparse GEMMs run **slower** than dense GEMMs in the same scenario. For example, compare the...
Cutlass profiler has a great set of flags to perform shmoos across different matrix shapes and sizes. While benchmarking GEMMs using the cutlass profiler, one can use Cublas as a...
Dynamic offsets in `DefaultEpilogue` allows to move pointer arithmetics to device and shift `C` and `D` pointers based on offsets stored in device memory. Depends on https://github.com/NVIDIA/cutlass/pull/1273
As stands, when a runtime assert is called on CUDA platforms your program just explodes with no stack trace and no mention of the error that was encountered. I just...