[DOC]: how the cutile support multistage/warp_specialization in gemm?
How would you describe the priority of this documentation request?
High
Please provide a link or source to the relevant docs
https://github.com/NVIDIA/cutile-python/blob/main/samples/MatMul.py
Describe the problems in the documentation
Refer to the sample code of matmul,
for k in range(num_tiles_k):
# Load tile from matrix A.
# The `index=(bidx, k_tile_idx)` specifies which (M-tile, K-tile) to load
# from global memory A. `shape=(tm, tk)` defines the size of this tile.
a = ct.load(A, index=(bidx, k), shape=(tm, tk), padding_mode=zero_pad).astype(dtype)
# Load tile from matrix B.
# The `index=(k_tile_idx, bidy)` specifies which (K-tile, N-tile) to load
# from global memory B. `shape=(tk, tn)` defines the size of this tile.
b = ct.load(B, index=(k, bidy), shape=(tk, tn), padding_mode=zero_pad).astype(dtype)
# Perform Matrix Multiplication for the current tiles.
# `ct.mma` computes the product of the two loaded tiles and accumulates the result.
accumulator = ct.mma(a, b, accumulator)
- It seems there is no strategy of multistage/warp_specialization applied for gemm?
- Is so, I wonder whether the performance of cutile is comparable to Cutlass/CuteDSL or Triton, or are there any relevant passes for the optimization in cutile?
- And I also want to know, how to use cp.async/tma/mbarrier in cutile.
(Optional) Propose a correction
No response
Contributing Guidelines
- [x] I agree to follow cuTile Python's contributing guidelines
- [x] I have searched the open documentation and have found no duplicates for this documentation request
Thanks for the interest in cuTile and for the great questions!
-
Pipelining & warp specialization strategies are not exposed in the DSL. Instead, they are automatically chosen and applied by the Tile IR compiler.
-
We encourage you to evaluate the performance on your specific workloads. To help you get started with your own comparison, please check out the samples linked in this comment: https://github.com/NVIDIA/cutile-python/issues/9#issuecomment-3620673607.
-
Direct control over these primitives is not exposed in the DSL. The compiler automatically selects asynchronous loads and stores, and decides whether or not to use TMA.
I hope this helps clarify the design.
Hi @blinxt, Thank you for your reply.
According to what you said, all related optimizations (pipeline/async load/compute stuff) will be completed automatically by the compiler. In other words, tileiras includes many hardware-specific optimizations, thereby mapping the same algorithm written by the frontend python dsl to the different hardware-specific execution paradigms, like multistage on ampere or warpspecialization on hopper, right?
If so, cutile-python's func ir simply provides a hardware-independent ir representation?(I'm not sure should it be called cutile-ir or cutile-python ir)
Also, I wonder is the bytecode format only used by tileiras internally, since we can seem some hardcode like mapping MmafOp to 73. https://github.com/NVIDIA/cutile-python/blob/main/src/cuda/tile/_bytecode/encodings.py#L1112-L1152
def encode_MmaFOp(
code_builder: CodeBuilder,
result_type: TypeId,
lhs: Value,
rhs: Value,
acc: Value,
) -> Value:
_buf = code_builder.buf
# Opcode
encode_varint(73, _buf)
# Result types
encode_typeid(result_type, _buf)
# Operands
encode_operand(lhs, _buf)
encode_operand(rhs, _buf)
encode_operand(acc, _buf)
return code_builder.new_op()
I'm curious if this bytecode encoding is the standard LLVM ir or MLIR format? Or, is this bytecode what's known as Tile IR?
@irasin That's correct. The compiler decides what hardware-specific optimizations to apply, though we do expose some hints to help you guide the compiler for specific workloads. See https://docs.nvidia.com/cuda/cutile-python/performance.html.
Today, cutile-python lowers to TileIR bytecode, which is then compiled by tileiras into binaries. More tooling to dump MLIR from TileIR bytecode is coming in the future.
@blinxt Thanks
So you mean the bytecode_buf here generated in cutile-python is what you call Tile IR?
# Compile MLIR module and generate cubin
with tempfile.NamedTemporaryFile(suffix='.bytecode', prefix=func_ir.qualname,
dir=context.config.temp_dir, delete=False) as f:
f.write(bytecode_buf)
f.flush()
try:
cubin_file = compile_cubin(f.name, compiler_options, sm_arch,
timeout_sec=context.config.compiler_timeout_sec)
except TileCompilerError as e:
if context.config.enable_crash_dump:
_compiler_crash_dump(func_ir, bytecode_generator, e.message,
e.compiler_flags, e.compiler_version)
raise e
@irasin Yes, bytecode_buf contains Tile IR Bytecode.