tilelang icon indicating copy to clipboard operation
tilelang copied to clipboard

[Language] support `T.gemm_sp_v2` on sm80 and sm89

Open botbw opened this issue 2 months ago β€’ 2 comments

Roadmap:

  • [ ] Clear TODOs

SM8x

  • [x] bf16/fp16
  • [x] customized metadata layout
  • [x] tf32
    • [ ] precision issue due to using fp32 as tf32
  • [x] int8
  • [x] fp8
  • [ ] different scopes
    • [ ] sss
    • [ ] srs
    • [ ] rss
    • [ ] rrs
    • [ ] metadata in register (?)

Misc

  • [x] provide compression utils example
  • [ ] perf optimization
    • [ ] metadata ldsm
  • [x] Doc

botbw avatar Oct 17 '25 09:10 botbw

πŸ‘‹ Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! πŸš€

github-actions[bot] avatar Oct 17 '25 09:10 github-actions[bot]

Walkthrough

Adds a new gemm_sp_v2 TileLang API and TL operator (GemmSPPy and GemmSPWarpPolicy), a SparseTensorCore intrinsics emitter and many SP layout helpers, refactors metadata layout API to make_cutlass_metadata_layout, updates examples/benchmarks/tests to use the new API, and exposes related utilities and tests.

Changes

Cohort / File(s) Summary
FFI / C++ op & reflection
src/op/gemm_sp.cc, src/op/gemm_sp.h
Register tl.GemmSPWarpPolicy, add reflection and FFI entry GemmSPWarpPolicyComputeWarpPartition; add reflection registration function.
C++ TileOp wiring
src/op/gemm_sp_py.cc, src/op/gemm_sp_py.h
New GemmSPPy TileOperator: arg deserialization, Clone, GetGemmInst/CheckWGMMA, Lower/InferLayout, TL op registration and static reflection init.
TileLang public API
tilelang/language/experimental/gemm_sp.py, tilelang/language/__init__.py
Add and export gemm_sp_v2 (argument legalization, strides/offsets, dispatch to tl.gemm_sp_py).
TileOp layer (Python)
tilelang/tileop/gemm_sp/*, tilelang/tileop/__init__.py
New GemmSPPy export; add GemmSPBase (accessors) and GemmSPMMA (infer_layout and lower for ss/sr/rs/rr).
Sparse emitter & layouts
tilelang/intrinsics/mma_sp_macro_generator.py, tilelang/intrinsics/mma_sp_layout.py, tilelang/intrinsics/mma_layout.py, tilelang/intrinsics/mma_macro_generator.py
Add SparseTensorCoreIntrinEmitter (ldmatrix/loads/mma_sp/stmatrix), many SP layout helpers, and 32x8β†’16x16 load layouts for non-ldmatrix fallbacks.
Layout API refactor
tilelang/layout/gemm_sp.py, tilelang/layout/__init__.py
Replace make_metadata_layout with make_cutlass_metadata_layout, remove backend arg, add SM90/SM8x creators and arch dispatch; update callers.
TileLang IR / policy
tilelang/ir.py
Add GemmSPWarpPolicy object and compute_warp_partition(..., bits) delegating to FFI.
TileOp typing
tilelang/tileop/gemm/__init__.py
Add type hints to gemm_py_infer_layout and gemm_py_lower signatures.
Utils: tensor & sparse
tilelang/utils/tensor.py, tilelang/utils/sparse.py
Add is_float8, fp8_remove_negative_zeros_, extend TensorSupplyType; add randint_semi_sparse and dtype-aware compress/randn behavior.
Templates / Debug / Common
src/tl_templates/cuda/debug.h, src/tl_templates/cuda/common.h
Add debug_print_buffer_value<uint16_t> specialization and new make_int4(short...) overload.
Benchmarks / Examples
benchmark/matmul/benchmark_matmul_sp.py, examples/gemm_sp/*, examples/sparse_tensorcore/tilelang_example_sparse_tensorcore.py, examples/gemm_sp/example_custom_compress.py
Update matmul_sp signature to accept in_dtype and call T.gemm_sp_v2; switch to make_cutlass_metadata_layout; add example custom compressor and config constants; adjust imports and CLI defaults.
Tests
examples/gemm_sp/test_example_gemm_sp.py, testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp.py, testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
Add example tests; add/refactor comprehensive gemm_sp and gemm_sp_v2 tests with dtype-aware input generators and dense-reference comparisons.
Docs / Index
docs/deeplearning_operators/matmul_sparse.md, docs/index.md
Add new matmul_sparse documentation and register it in docs index.
Profiler exports
tilelang/profiler/__init__.py
Import and re-export is_float8 from tilelang.utils.tensor (replace local impl).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TileLang as gemm_sp_v2
    participant TL_Op as GemmSPPy
    participant TileOp as GemmSPMMA
    participant Emitter as SparseTensorCoreIntrinEmitter

    User->>TileLang: call gemm_sp_v2(A_sparse, E, B, C, ...)
    TileLang->>TL_Op: construct GemmSPPy node (buffers, args)
    TL_Op->>TileOp: infer_layout(target, thread_nums)
    TileOp->>Emitter: build emitter for pattern (ss/sr/rs/rr)
    Emitter->>Emitter: ldmatrix / make_mma_load_layout / mma_sp / stmatrix
    Emitter-->>TileOp: fragment/layout map
    TileOp-->>TL_Op: layout_map
    TL_Op->>TileOp: lower(target, thread_nums, thread_var)
    TileOp->>Emitter: emit lowering -> prim_func
    TileOp-->>TL_Op: lowered kernel
    TL_Op-->>User: compiled kernel handle

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

  • Attention points:
    • tilelang/intrinsics/mma_sp_macro_generator.py β€” large emitter with dense index/thread math and dtype-conditional branches.
    • tilelang/tileop/gemm_sp/gemm_sp_mma.py β€” multiple kernel variants, warp partitioning, lowering complexity.
    • src/op/gemm_sp_py.cc / src/op/gemm_sp_py.h β€” FFI deserialization, pointer/stride handling and lowering hooks.
    • tilelang/layout/gemm_sp.py β€” metadata layout math, arch dispatch, FP8 handling.
    • benchmark/examples/tests β€” confirm API updates (matmul_sp signature, in_dtype propagation, make_cutlass_metadata_layout, T.gemm_sp_v2).

Possibly related issues

  • tile-ai/tilelang#935 β€” New gemm_sp_v2 API and Cutlass metadata layout migration align with this issue's objectives.

Possibly related PRs

  • tile-ai/tilelang#691 β€” Modifies the same benchmark matmul path and metadata/layout handling; likely overlap.
  • tile-ai/tilelang#783 β€” Related GEMM-SP operator, warp-policy, and FFI binding changes.
  • tile-ai/tilelang#1129 β€” Overlaps low-level CUDA template integer packing helpers (related make_int4 additions).

Suggested reviewers

  • LeiWang1999
  • chengyupku

Poem

πŸ‡ I hopped through fragments, layouts bright,
I packed int4s and tuned the thread's flight.
gemm_sp_v2 sings sparse delight,
Metadata mapped, kernels take flight.
A rabbit cheers β€” sparse math feels light!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title clearly and specifically describes the main change: adding support for T.gemm_sp_v2 on SM80 and SM89 architectures, which aligns with the core objective documented in the PR.
✨ Finishing touches
  • [ ] πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

[!TIP]

πŸ“ Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests β€” including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. πŸ“ Description β€” Summarize the main change in 50–60 words, explaining what was done.
  2. πŸ““ References β€” List relevant issues, discussions, documentation, or related PRs.
  3. πŸ“¦ Dependencies & Requirements β€” Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. πŸ“Š Contributor Summary β€” Include a Markdown table showing contributions: | Contributor | Lines Added | Lines Removed | Files Changed |
  5. βœ”οΈ Additional Notes β€” Add any extra reviewer context. Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 17 '25 09:10 coderabbitai[bot]

we're good to go if we can resolve the conflict and I think then we can let this pr in.

LeiWang1999 avatar Nov 13 '25 16:11 LeiWang1999