iree icon indicating copy to clipboard operation
iree copied to clipboard

[RFC] Export multiple executables as a part of auto-tune group?

Open ezhulenev opened this issue 1 year ago • 5 comments

Request description

Triton heavily relies on run time auto-tuning to select the best kernel at runtime, tunable parameters are typically tile/block size, and they also impact the grid size.

Currently it's possible to compile Triton IR to a single custom dispatch, but there is not clear way to do auto-tuning today.

Proposal: let's add an explicit operation for auto-tunable groups of exports.

Example:

  hal.executable.source private @executable attributes {
    objects = #hal.executable.objects<{
      #nvptx_sm_80_target = [
        #hal.executable.object<{
          path = "path/to/a/ptx/with/all/triton/kernels"
        }>
      ]
    }>
  } {

    // All variants must have the same layout
    hal.executable.autotuning.export @matmul
      variants([@matmul_tile_32x32, @matmul_tile_64x64])

    hal.executable.export public @matmul_tile_32x32 ordinal(0)
        layout(...) attributes { workgroup_size = [64 : index, 1 : index, 1 : index] } {
    ^bb0(%device: !hal.device, %workload: index):
      hal.return %grid_for_32x32_tiling 
    }

    hal.executable.export public @matmul_tile_64x64 ordinal(1)
        layout(...) attributes { workgroup_size = [64 : index, 1 : index, 1 : index] } {
    ^bb0(%device: !hal.device, %workload: index):
      hal.return %grid_for_64x64_tiling 
    }

  }  // hal.executable.source

At run time we'll have two options:

  1. hal.executable.autotuning.export will start with round-robin kernel selection to collect statistics, once numbers are stable it will always run the best kernel
  2. Add an @__autotune function (similar to globals initialization) that will do auto tuning with fake data, so that we get reproducible runs of the "main" computetion
  3. Automatically do auto tuning and module initialization time?

Questions:

  1. Should all exported variants be a part of a single executable (PTX)?

What component(s) does this issue relate to?

Compiler, Runtime

Additional context

No response

ezhulenev avatar May 30 '23 20:05 ezhulenev