cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
`-I/usr/local/cuda/include/cccl` in cuda 13 (related #2543) remove duplicated `--cuda-gpu-arch` `_LIBCUDACXX_STD_VER` is deprecated and not used by the project
Ports over some of the latex printing functionality from C++ and adds an example.
### Which component has the problem? CuTe DSL ### Bug Report **Describe the bug** **Steps/Code to reproduce bug** ``` import cutlass.cute as cute @cute.jit def test(): layoutA = cute.make_layout((4, 4),...
**What is your question?** I am trying to use cutlass on Ampere architecture to multiply two rectangular matrix MxK and KxN where M and N are small (say 16) and...
### Which component has the problem? CuTe DSL ### Bug Report Building nvidia-cutlass-dsl with dynamic versioning always produces a wheel with version 0.0.0 due to missing VERSION.EDITABLE. Suggest using setuptools-scm...
Summary ------- Implements dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) using CUTLASS 3.x. The dual-GEMM operation implemented is: ``` D0 = epilogue0(X @ B0, C0) D1 = epilogue1(X @...
I need to fix blockM and blockN to ensure batch invariance. Can the CUTLASS gemm interface control this? using GemmKernel = cutlass::gemm::kernel::GemmUniversal< cute::Shape, CollectiveMainloop, CollectiveEpilogue>; thanks in advance!
### Which component has the problem? CuTe DSL ### Bug Report **Describe the bug** with nvidia-cutlass and nvidia-cutlass-dsl 4.2.0.0 ``` python cutlass/examples/python/CuTeDSL/blackwell/tutorial_gemm/fp16_gemm_1.py nvidia_cutlass_dsl/python_packages/cutlass/cute/nvgpu/tcgen05/mma.py", line 153, in __post_init__ raise OpError( cutlass.cute.nvgpu.common.OpError:...
### Which component has the problem? CuTe DSL ### Bug Report **Steps/Code to reproduce bug** ``` import torch import cutlass import cutlass.cute as cute from cutlass.cute.runtime import from_dlpack @cute.jit def...