cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121

Open huangyucbr-hub opened this issue 1 month ago • 1 comments

Which component has the problem?

CuTe DSL

Bug Report

Bug Report

Summary

CUTLASS 4.2+ added SM120 and SM121 kernel support for Blackwell GeForce (RTX 50-series) and DGX Spark (GB10) GPUs according to the https://docs.nvidia.com/cutlass/4.2.1/CHANGELOG.html, but the Python DSL BlockScaledMmaOp class restricts FP4 operations to sm_100a only, preventing use on sm_120 and sm_121 hardware.

Environment

  • Hardware: NVIDIA DGX Spark GB10 (Compute Capability 12.1, sm_121)
  • Package: nvidia-cutlass-dsl version 4.3.0 (latest from PyPI)
  • Python: 3.13
  • CUDA: 13.0.1
  • PyTorch: 2.10.0.dev20251118+cu130

Bug Location

File: python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py Class: BlockScaledMmaOp Lines: 303-305

@dataclass(frozen=True) class BlockScaledMmaOp(Tcgen05MmaOp): # ... other fields ...

  admissible_archs = [
      Arch.sm_100a,  # ← Only sm_100a allowed
  ]

  def __post_init__(self) -> None:
      arch = CuTeDSL._get_dsl().get_arch_enum()
      if arch not in self.admissible_archs:
          raise OpError(
              self,
              f"expects arch to be one of {self.admissible_archs}, but got {arch}",
              suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
          )

Reproduction

Minimal Test Case

from cutlass.cute.nvgpu.tcgen05 import MmaMXF4NVF4Op from cutlass import Float8E8M0FNU, Arch from cutlass.cute.nvgpu.tcgen05 import CtaGroup, OperandSource

On sm_121 (GB10) or sm_120 (RTX 5090) hardware:

mma_op = MmaMXF4NVF4Op( sf_dtype=Float8E8M0FNU, instruction_shape=(16, 16, 64), cta_group=CtaGroup.ONE, a_src=OperandSource.TMEM )

Error Output

OpError: expects arch to be one of [Arch.sm_100a], but got Arch.sm_121a

Traceback (most recent call last): File "test_fp4.py", line 6, in mma_op = MmaMXF4NVF4Op(...) File ".../mma.py", line 311, in post_init raise OpError( self, f"expects arch to be one of {self.admissible_archs}, but got {arch}", suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture", )

Expected Behavior

Based on CUTLASS 4.2 changelog: "Support for Blackwell SM121 kernels for DGX Spark GPUs. Share the major codes with Blackwell SM120 kernels."

The Python DSL should allow FP4 operations on sm_120 (RTX 5090) and sm_121 (GB10) architectures, consistent with C++ API support.

Proposed Fix

admissible_archs = [ Arch.sm_100a, # B200/B100 datacenter Blackwell Arch.sm_120a, # RTX 5090 GeForce Blackwell (compute 12.0) Arch.sm_121a, # GB10 DGX Spark Blackwell (compute 12.1) ]

Validation

  • ✅ Hardware verified: GB10 has 5th-generation Tensor Cores with FP4 support (1 PFLOPS peak)
  • ✅ C++ API works: vLLM successfully uses CUTLASS FP4 on sm_120 via C++ API (https://github.com/vllm-project/vllm/pull/21309)
  • ✅ Patch tested: Applying the proposed fix eliminates the architecture error on GB10
  • ⚠️ Kernel availability: Pre-compiled sm_121 kernels may still be unavailable (separate issue)

Impact

Affected Users:

  • ❌ All RTX 5090 users (sm_120)
  • ❌ All DGX Spark GB10 users (sm_121)
  • ❌ Potentially RTX 5080/5070/5060 users (also sm_120)

Workaround: Users can manually patch the installed package, but this:

  • Requires editing system packages (not ideal)
  • Gets overwritten on package upgrades
  • Isn't discoverable to most users

Additional Context

CUTLASS Changelog References

  • https://docs.nvidia.com/cutlass/4.2.1/CHANGELOG.html: "Support for Blackwell SM121 kernels for DGX Spark GPUs"
  • https://docs.nvidia.com/cutlass/CHANGELOG.html: "Blockscaled ultra fp4 dense GEMM" and "SM120 mixed input blockscaled grouped GEMM"

Related Work

  • vLLM PR #21309: Demonstrates CUTLASS NVFP4 working on sm_120 via C++ API
  • cuBLAS 12.9: Supports FP4 block-scaled operations on Blackwell

Hardware Specifications

  • sm_100: B200/B100 datacenter GPUs
  • sm_120: RTX 5090/5080/5070/5060 consumer GPUs
  • sm_121: GB10 superchip (DGX Spark, Project DIGITS)

All have 5th-generation Tensor Cores with hardware FP4 support.

Request

Please update BlockScaledMmaOp.admissible_archs to include Arch.sm_120a and Arch.sm_121a to match the C++ API's architecture support.

This will enable the Python DSL FP4 functionality on the full Blackwell GPU family, not just datacenter variants.

Thank you for maintaining this excellent library!

huangyucbr-hub avatar Nov 22 '25 16:11 huangyucbr-hub

vgoklani avatar Dec 09 '25 03:12 vgoklani