iree
iree copied to clipboard
[RFC] Export multiple executables as a part of auto-tune group?
Request description
Triton heavily relies on run time auto-tuning to select the best kernel at runtime, tunable parameters are typically tile/block size, and they also impact the grid size.
Currently it's possible to compile Triton IR to a single custom dispatch, but there is not clear way to do auto-tuning today.
Proposal: let's add an explicit operation for auto-tunable groups of exports.
Example:
hal.executable.source private @executable attributes {
objects = #hal.executable.objects<{
#nvptx_sm_80_target = [
#hal.executable.object<{
path = "path/to/a/ptx/with/all/triton/kernels"
}>
]
}>
} {
// All variants must have the same layout
hal.executable.autotuning.export @matmul
variants([@matmul_tile_32x32, @matmul_tile_64x64])
hal.executable.export public @matmul_tile_32x32 ordinal(0)
layout(...) attributes { workgroup_size = [64 : index, 1 : index, 1 : index] } {
^bb0(%device: !hal.device, %workload: index):
hal.return %grid_for_32x32_tiling
}
hal.executable.export public @matmul_tile_64x64 ordinal(1)
layout(...) attributes { workgroup_size = [64 : index, 1 : index, 1 : index] } {
^bb0(%device: !hal.device, %workload: index):
hal.return %grid_for_64x64_tiling
}
} // hal.executable.source
At run time we'll have two options:
-
hal.executable.autotuning.export
will start with round-robin kernel selection to collect statistics, once numbers are stable it will always run the best kernel - Add an
@__autotune
function (similar to globals initialization) that will do auto tuning with fake data, so that we get reproducible runs of the "main" computetion - Automatically do auto tuning and module initialization time?
Questions:
- Should all exported variants be a part of a single executable (PTX)?
What component(s) does this issue relate to?
Compiler, Runtime
Additional context
No response