cutlass
cutlass copied to clipboard
CUDA Templates for Linear Algebra Subroutines
### Which component has the problem? CuTe DSL ### Bug Report **Describe the bug** pip install -e . still creating 4.2.0.0 However pip install -e . in python/CuTeDSL creates 4.3.0.dev0...
In `include/cutlass/arch/mma_sm90.h` the ptx instruction `mma.sync.aligned.m16n8k16` has a typo in the variable for `%5` which should be `A[1]` but is currently `A[2]` (and hence using `A[2]` twice and `A[1]` not...
Update CuTeDSL/requirements.txt so that [prep_editable_install.py](https://github.com/NVIDIA/cutlass/blob/main/python/CuTeDSL/prep_editable_install.py) downloads correct wheel. Otherwise, `import cutlass` fails after editable install due to refactor of of `_mlir_libs` to `_cutlass_ir` in `python/CuTeDSL/__init__.py`.
On Blackwell, most dispatch policies rely on auto. This change adds auto-mode dispatch validation for mx data types.
Performance: - 52.1 TFLOPS on NVIDIA L4 (Ada, SM 8.9) - 1.74× faster than CUTLASS 4.3.0 baseline (~30 TFLOPS) - 63× faster than cuSPARSE (0.87 TFLOPS) - 83% efficiency vs...
I have noticed that CuteDSL only supports fp16/bf16 warp mma with shape `m16n8k16` and `m16n8k8` now. https://github.com/NVIDIA/cutlass/blob/ec8daf642d69fc31352ac6fa6e14a0de9019604b/python/CuTeDSL/cutlass/cute/nvgpu/warp/mma.py Are there any other support plans in the future, such as - Turing...
### Which component has the problem? CuTe DSL ### Bug Report **Describe the bug** When running [grouped_blockscaled_gemm.py](https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/grouped_blockscaled_gemm.py) with `use_cold_l2 == True`, I run into an IMA with `warmup_iterations + iterations...
`-Wimplicit-fallthrough` is a very high signal warning. In internal testing we found that 30-40% of flagged instances were bugs of some sort. CUTLASS currently passes `-Wimplicit-fallthrough`, but doesn't enforce it....
Since the term synchronize may cause confusion that user thought it means stream sync, it simply means we pass the right current stream as env stream