cutlass issues

[QST] Cross compile (compute capability) using CuTeDSL + TVM-FFI

**What is your question?** Hello, I am testing the AOT feature using CuTeDSL with TVM-FFI. Does AOT compilation support cross-compilation for a different compute capability? For example, for the `examples/python/CuTeDSL/cute/tvm_ffi/aot_export.py`...

ktaebum

question

? - Needs Triage

[BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121

1

### Which component has the problem? CuTe DSL ### Bug Report Bug Report Summary CUTLASS 4.2+ added SM120 and SM121 kernel support for Blackwell GeForce (RTX 50-series) and DGX Spark...

huangyucbr-hub

bug

? - Needs Triage

CuTe DSL

Unit tests for Kernels that perform BF16 x BF16 = MXFP8 and MXFP8 x MXFP8 = BF16

Shreya-gaur

[QST] CuteDSL in memory pass

1

Here I write a simple cuteDSL program in order to perform cast from fp32 tensor to bf16 tensor: ``` import argparse import math import torch import triton from typing import...

Dingjifeng

question

? - Needs Triage

[QST] How to stop unroll in cute.copy in cute dsl?

1

**What is your question?** cute.copy will always fully unroll its inner load/store. But in some case, the unrolling in cute.copy will case serious register spill. So I wonder how to...

monellz

question

? - Needs Triage

[QST] cutedsl usage

1

When I compile cutdsl from source and run `import cutlass`, I get the error "No module named 'cutlass._mlir'". I'd like to know what operations need to be performed on the...

yangjianfengo1

question

? - Needs Triage

[FEA] Add Windows support in CuTe wheels on Pipy

7

### Which component requires the feature? CuTe DSL ### Feature Request Hi, pip install nvidia-cutlass-dsl fails on Windows as seeing latest 4.1.0: https://pypi.org/project/nvidia-cutlass-dsl/#files only supports manylinux.. so requesting Windows support.....

oscarbg

feature request

? - Needs Triage

CuTe DSL

Remove redundant "from" from comment

crcrpar

[BUG] Build cutlass with arch=100a failed

5

### Which component has the problem? CUTLASS C++ ### Bug Report **Describe the bug** [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm100_bf16_gemm_grouped_e2m1_objs.dir/generated/gemm/100/bf16_gemm_grouped_e2m1/cutlass3x_sm100_bstensorop_gemm_grouped_ue8m0xe2m1_ue8m0xe2m1_f32_bf16_bf16_256x64x256_0x0x1_0_tnt_align32_o_vs32_2sm_epi_tma.cu.o cd /workspace/cutlass/build/tools/library && /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler --options-file CMakeFiles/cutlass_library_gemm_sm100_bf16_gemm_grouped_e2m1_objs.dir/includes_CUDA.rsp -DCUTLASS_VERSIONS_GENERATED -O3 -DNDEBUG...

Peakulorain

bug

? - Needs Triage

inactive-30d

CUTLASS C++

[QST] About Example 23: Ampere GEMM Operand Reduction Fusion with kGemmSplitKParallel

2

I’m checking out Example 23 and found a thing when using kGemmSplitKParallel mode; I’d like to get this cleared up: In this mode, the example explicitly allocates a block of...

kitecats

question

? - Needs Triage

inactive-30d

cutlass
cutlass copied to clipboard

Metadata

[QST] Cross compile (compute capability) using CuTeDSL + TVM-FFI

[BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121

Unit tests for Kernels that perform BF16 x BF16 = MXFP8 and MXFP8 x MXFP8 = BF16

[QST] CuteDSL in memory pass

[QST] How to stop unroll in cute.copy in cute dsl?

[QST] cutedsl usage

[FEA] Add Windows support in CuTe wheels on Pipy

Remove redundant "from" from comment

[BUG] Build cutlass with arch=100a failed

[QST] About Example 23: Ampere GEMM Operand Reduction Fusion with kGemmSplitKParallel

← Metadata

Owner

Metadata

cutlass cutlass copied to clipboard

Metadata

← Metadata

Owner

Metadata

cutlass
cutlass copied to clipboard