Matmul dispatch failing to compile due to shared memory allocation
What happened?
Matmul dispatch is over allocating to shared memory on MI300X target and failing to compile with the following error:
./vae_decomp_f32_dps/compiled_vae_decode$async_dispatch_18.mlir:9:6: error: 'func.func' op uses 86016 bytes of shared memory; exceeded the limit of 65536 bytes func.func @decode$async_dispatch_18_matmul_transpose_b_16384x512x512_f32() { ^ ./vae_decomp_f32_dps/compiled_vae_decode$async_dispatch_18.mlir:2:2: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute = fp64|fp32|fp16|int64|int32|int16|int8, storage = b64|b32|b16|b8, subgroup = shuffle|arithmetic, dot = dp4xi8toi32, mma = [<MFMA_F16_16x16x16_F32>, <MFMA_F16_32x32x8_F32>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}> hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute = fp64|fp32|fp16|int64|int32|int16|int8, storage = b64|b32|b16|b8, subgroup = shuffle|arithmetic, dot = dp4xi8toi32, mma = [<MFMA_F16_16x16x16_F32>, <MFMA_F16_32x32x8_F32>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}>) { ^ failed to translate executables
Steps to reproduce your issue
compile command: iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-rocm-waves-per-eu=2 --iree-flow-enable-aggressive-fusion --iree-codegen-llvmgpu-use-vector-distribution=false --iree-global-opt-propagate-transposes=true --iree-opt-const-eval=false --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-flow-inline-constants-max-byte-length=1 --iree-preprocessing-pass-pipeline="builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics))" vae_decomp_f32_dps/configured_compiled_vae_decode$async_dispatch_18.mlir -o vae.vmfb
input dispatch IR: https://gist.github.com/IanNod/283d68f9aea0dcb50e94d2b2820bbb21
What component(s) does this issue relate to?
Compiler
Version information
c1e542d6370473244a8fa9178615cb8a6041b489
Additional context
No response