cutile-python icon indicating copy to clipboard operation
cutile-python copied to clipboard

[DOC]: how the cutile support multistage/warp_specialization in gemm?

Open irasin opened this issue 2 weeks ago • 5 comments

How would you describe the priority of this documentation request?

High

Please provide a link or source to the relevant docs

https://github.com/NVIDIA/cutile-python/blob/main/samples/MatMul.py

Describe the problems in the documentation

Refer to the sample code of matmul

    for k in range(num_tiles_k):
        # Load tile from matrix A.
        # The `index=(bidx, k_tile_idx)` specifies which (M-tile, K-tile) to load
        # from global memory A. `shape=(tm, tk)` defines the size of this tile.
        a = ct.load(A, index=(bidx, k), shape=(tm, tk), padding_mode=zero_pad).astype(dtype)

        # Load tile from matrix B.
        # The `index=(k_tile_idx, bidy)` specifies which (K-tile, N-tile) to load
        # from global memory B. `shape=(tk, tn)` defines the size of this tile.
        b = ct.load(B, index=(k, bidy), shape=(tk, tn), padding_mode=zero_pad).astype(dtype)

        # Perform Matrix Multiplication for the current tiles.
        # `ct.mma` computes the product of the two loaded tiles and accumulates the result.
        accumulator = ct.mma(a, b, accumulator)

  1. It seems there is no strategy of multistage/warp_specialization applied for gemm?
  2. Is so, I wonder whether the performance of cutile is comparable to Cutlass/CuteDSL or Triton, or are there any relevant passes for the optimization in cutile?
  3. And I also want to know, how to use cp.async/tma/mbarrier in cutile.

(Optional) Propose a correction

No response

Contributing Guidelines

  • [x] I agree to follow cuTile Python's contributing guidelines
  • [x] I have searched the open documentation and have found no duplicates for this documentation request

irasin avatar Dec 07 '25 07:12 irasin

Thanks for the interest in cuTile and for the great questions!

  1. Pipelining & warp specialization strategies are not exposed in the DSL. Instead, they are automatically chosen and applied by the Tile IR compiler.

  2. We encourage you to evaluate the performance on your specific workloads. To help you get started with your own comparison, please check out the samples linked in this comment: https://github.com/NVIDIA/cutile-python/issues/9#issuecomment-3620673607.

  3. Direct control over these primitives is not exposed in the DSL. The compiler automatically selects asynchronous loads and stores, and decides whether or not to use TMA.

I hope this helps clarify the design.

blinxt avatar Dec 08 '25 19:12 blinxt

Hi @blinxt, Thank you for your reply.

According to what you said, all related optimizations (pipeline/async load/compute stuff) will be completed automatically by the compiler. In other words, tileiras includes many hardware-specific optimizations, thereby mapping the same algorithm written by the frontend python dsl to the different hardware-specific execution paradigms, like multistage on ampere or warpspecialization on hopper, right?

If so, cutile-python's func ir simply provides a hardware-independent ir representation?(I'm not sure should it be called cutile-ir or cutile-python ir)

Also, I wonder is the bytecode format only used by tileiras internally, since we can seem some hardcode like mapping MmafOp to 73. https://github.com/NVIDIA/cutile-python/blob/main/src/cuda/tile/_bytecode/encodings.py#L1112-L1152

def encode_MmaFOp(
    code_builder: CodeBuilder,
    result_type: TypeId,
    lhs: Value,
    rhs: Value,
    acc: Value,
) -> Value:
    _buf = code_builder.buf
    # Opcode
    encode_varint(73, _buf)
    # Result types
    encode_typeid(result_type, _buf)
    # Operands
    encode_operand(lhs, _buf)
    encode_operand(rhs, _buf)
    encode_operand(acc, _buf)
    return code_builder.new_op()

I'm curious if this bytecode encoding is the standard LLVM ir or MLIR format? Or, is this bytecode what's known as Tile IR?

irasin avatar Dec 09 '25 02:12 irasin

@irasin That's correct. The compiler decides what hardware-specific optimizations to apply, though we do expose some hints to help you guide the compiler for specific workloads. See https://docs.nvidia.com/cuda/cutile-python/performance.html.

Today, cutile-python lowers to TileIR bytecode, which is then compiled by tileiras into binaries. More tooling to dump MLIR from TileIR bytecode is coming in the future.

blinxt avatar Dec 09 '25 19:12 blinxt

@blinxt Thanks So you mean the bytecode_buf here generated in cutile-python is what you call Tile IR?

    # Compile MLIR module and generate cubin
    with tempfile.NamedTemporaryFile(suffix='.bytecode', prefix=func_ir.qualname,
                                     dir=context.config.temp_dir, delete=False) as f:
        f.write(bytecode_buf)
        f.flush()

        try:
            cubin_file = compile_cubin(f.name, compiler_options, sm_arch,
                                       timeout_sec=context.config.compiler_timeout_sec)
        except TileCompilerError as e:
            if context.config.enable_crash_dump:
                _compiler_crash_dump(func_ir, bytecode_generator, e.message,
                                     e.compiler_flags, e.compiler_version)

            raise e

irasin avatar Dec 10 '25 09:12 irasin

@irasin Yes, bytecode_buf contains Tile IR Bytecode.

blinxt avatar Dec 10 '25 18:12 blinxt