cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
from my naive understanding the second arrow (output "?") is correct
Tensors in Cute DSL uses strides in int32 by default. This causes IMA for large tensors. Is there a way to force strides to be int64? **Steps/Code to reproduce bug**...
**Is your feature request related to a problem? Please describe.** It would be nice to have utility function in `CuTeDSL` like `print_latex` in `C++` API **Describe the solution you'd like**...
**What is your question?** When profiling CUDA/CUTLASS, the profiler can provide line-by-line profiling for user code, in addition to PTX and SASS. Triton can also do this, likely because its...
Hello! This MR provides two things: 1) Zero points for default mode 2) GPT-Q [semantics](https://pytorch.org/blog/accelerating-triton/) Closes #2261
At least 4.1 and `3.9.0.0` is missing from PYPI. As cuda-python 12.6.2 is required for CUDA 12.6 and that has API deprecations (`import cuda.bindings.cuda` instead of `import cuda.cuda`) nvidia-cutlass 4.1...
## code https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/ampere/tensorop_gemm.py ## env ``` pip list | grep -i cutlass nvidia-cutlass-dsl 4.3.0 nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Wed_Aug_20_01:57:39_PM_PDT_2025...