[Language] support `T.gemm_sp_v2` on sm80 and sm89
Roadmap:
- [ ] Clear TODOs
SM8x
- [x] bf16/fp16
- [x] customized metadata layout
- [x] tf32
- [ ] precision issue due to using fp32 as tf32
- [x] int8
- [x] fp8
- [ ] different scopes
- [ ] sss
- [ ] srs
- [ ] rss
- [ ] rrs
- [ ] metadata in register (?)
Misc
- [x] provide compression utils example
- [ ] perf optimization
- [ ] metadata ldsm
- [x] Doc
π Hi! Thank you for contributing to the TileLang project.
Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.
We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! π
Walkthrough
Adds a new gemm_sp_v2 TileLang API and TL operator (GemmSPPy and GemmSPWarpPolicy), a SparseTensorCore intrinsics emitter and many SP layout helpers, refactors metadata layout API to make_cutlass_metadata_layout, updates examples/benchmarks/tests to use the new API, and exposes related utilities and tests.
Changes
| Cohort / File(s) | Summary |
|---|---|
FFI / C++ op & reflection src/op/gemm_sp.cc, src/op/gemm_sp.h |
Register tl.GemmSPWarpPolicy, add reflection and FFI entry GemmSPWarpPolicyComputeWarpPartition; add reflection registration function. |
C++ TileOp wiring src/op/gemm_sp_py.cc, src/op/gemm_sp_py.h |
New GemmSPPy TileOperator: arg deserialization, Clone, GetGemmInst/CheckWGMMA, Lower/InferLayout, TL op registration and static reflection init. |
TileLang public API tilelang/language/experimental/gemm_sp.py, tilelang/language/__init__.py |
Add and export gemm_sp_v2 (argument legalization, strides/offsets, dispatch to tl.gemm_sp_py). |
TileOp layer (Python) tilelang/tileop/gemm_sp/*, tilelang/tileop/__init__.py |
New GemmSPPy export; add GemmSPBase (accessors) and GemmSPMMA (infer_layout and lower for ss/sr/rs/rr). |
Sparse emitter & layouts tilelang/intrinsics/mma_sp_macro_generator.py, tilelang/intrinsics/mma_sp_layout.py, tilelang/intrinsics/mma_layout.py, tilelang/intrinsics/mma_macro_generator.py |
Add SparseTensorCoreIntrinEmitter (ldmatrix/loads/mma_sp/stmatrix), many SP layout helpers, and 32x8β16x16 load layouts for non-ldmatrix fallbacks. |
Layout API refactor tilelang/layout/gemm_sp.py, tilelang/layout/__init__.py |
Replace make_metadata_layout with make_cutlass_metadata_layout, remove backend arg, add SM90/SM8x creators and arch dispatch; update callers. |
TileLang IR / policy tilelang/ir.py |
Add GemmSPWarpPolicy object and compute_warp_partition(..., bits) delegating to FFI. |
TileOp typing tilelang/tileop/gemm/__init__.py |
Add type hints to gemm_py_infer_layout and gemm_py_lower signatures. |
Utils: tensor & sparse tilelang/utils/tensor.py, tilelang/utils/sparse.py |
Add is_float8, fp8_remove_negative_zeros_, extend TensorSupplyType; add randint_semi_sparse and dtype-aware compress/randn behavior. |
Templates / Debug / Common src/tl_templates/cuda/debug.h, src/tl_templates/cuda/common.h |
Add debug_print_buffer_value<uint16_t> specialization and new make_int4(short...) overload. |
Benchmarks / Examples benchmark/matmul/benchmark_matmul_sp.py, examples/gemm_sp/*, examples/sparse_tensorcore/tilelang_example_sparse_tensorcore.py, examples/gemm_sp/example_custom_compress.py |
Update matmul_sp signature to accept in_dtype and call T.gemm_sp_v2; switch to make_cutlass_metadata_layout; add example custom compressor and config constants; adjust imports and CLI defaults. |
Tests examples/gemm_sp/test_example_gemm_sp.py, testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp.py, testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py |
Add example tests; add/refactor comprehensive gemm_sp and gemm_sp_v2 tests with dtype-aware input generators and dense-reference comparisons. |
Docs / Index docs/deeplearning_operators/matmul_sparse.md, docs/index.md |
Add new matmul_sparse documentation and register it in docs index. |
Profiler exports tilelang/profiler/__init__.py |
Import and re-export is_float8 from tilelang.utils.tensor (replace local impl). |
Sequence Diagram(s)
sequenceDiagram
participant User
participant TileLang as gemm_sp_v2
participant TL_Op as GemmSPPy
participant TileOp as GemmSPMMA
participant Emitter as SparseTensorCoreIntrinEmitter
User->>TileLang: call gemm_sp_v2(A_sparse, E, B, C, ...)
TileLang->>TL_Op: construct GemmSPPy node (buffers, args)
TL_Op->>TileOp: infer_layout(target, thread_nums)
TileOp->>Emitter: build emitter for pattern (ss/sr/rs/rr)
Emitter->>Emitter: ldmatrix / make_mma_load_layout / mma_sp / stmatrix
Emitter-->>TileOp: fragment/layout map
TileOp-->>TL_Op: layout_map
TL_Op->>TileOp: lower(target, thread_nums, thread_var)
TileOp->>Emitter: emit lowering -> prim_func
TileOp-->>TL_Op: lowered kernel
TL_Op-->>User: compiled kernel handle
Estimated code review effort
π― 5 (Critical) | β±οΈ ~120 minutes
- Attention points:
- tilelang/intrinsics/mma_sp_macro_generator.py β large emitter with dense index/thread math and dtype-conditional branches.
- tilelang/tileop/gemm_sp/gemm_sp_mma.py β multiple kernel variants, warp partitioning, lowering complexity.
- src/op/gemm_sp_py.cc / src/op/gemm_sp_py.h β FFI deserialization, pointer/stride handling and lowering hooks.
- tilelang/layout/gemm_sp.py β metadata layout math, arch dispatch, FP8 handling.
- benchmark/examples/tests β confirm API updates (matmul_sp signature, in_dtype propagation, make_cutlass_metadata_layout, T.gemm_sp_v2).
Possibly related issues
- tile-ai/tilelang#935 β New gemm_sp_v2 API and Cutlass metadata layout migration align with this issue's objectives.
Possibly related PRs
- tile-ai/tilelang#691 β Modifies the same benchmark matmul path and metadata/layout handling; likely overlap.
- tile-ai/tilelang#783 β Related GEMM-SP operator, warp-policy, and FFI binding changes.
- tile-ai/tilelang#1129 β Overlaps low-level CUDA template integer packing helpers (related
make_int4additions).
Suggested reviewers
- LeiWang1999
- chengyupku
Poem
π I hopped through fragments, layouts bright,
I packed int4s and tuned the thread's flight.
gemm_sp_v2 sings sparse delight,
Metadata mapped, kernels take flight.
A rabbit cheers β sparse math feels light!
Pre-merge checks and finishing touches
β Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | β οΈ Warning | Docstring coverage is 20.18% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
β Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | β Passed | Check skipped - CodeRabbitβs high-level summary is enabled. |
| Title check | β Passed | The title clearly and specifically describes the main change: adding support for T.gemm_sp_v2 on SM80 and SM89 architectures, which aligns with the core objective documented in the PR. |
β¨ Finishing touches
- [ ] π Generate docstrings
π§ͺ Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
[!TIP]
π Customizable high-level summaries are now available in beta!
You can now customize how CodeRabbit generates the high-level summary in your pull requests β including its content, structure, tone, and formatting.
- Provide your own instructions using the
high_level_summary_instructionssetting.- Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
- Use
high_level_summary_in_walkthroughto move the summary from the description to the walkthrough section.Example instruction:
"Divide the high-level summary into five sections:
- π Description β Summarize the main change in 50β60 words, explaining what was done.
- π References β List relevant issues, discussions, documentation, or related PRs.
- π¦ Dependencies & Requirements β Mention any new/updated dependencies, environment variable changes, or configuration updates.
- π Contributor Summary β Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |- βοΈ Additional Notes β Add any extra reviewer context. Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
we're good to go if we can resolve the conflict and I think then we can let this pr in.