Matmul dispatch failing to compile due to shared memory allocation

Open IanNod opened this issue 1 year ago • 0 comments

What happened?

Matmul dispatch is over allocating to shared memory on MI300X target and failing to compile with the following error:

./vae_decomp_f32_dps/compiled_vae_decode$async_dispatch_18.mlir:9:6: error: 'func.func' op uses 86016 bytes of shared memory; exceeded the limit of 65536 bytes func.func @decode$async_dispatch_18_matmul_transpose_b_16384x512x512_f32() { ^ ./vae_decomp_f32_dps/compiled_vae_decode$async_dispatch_18.mlir:2:2: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute = fp64|fp32|fp16|int64|int32|int16|int8, storage = b64|b32|b16|b8, subgroup = shuffle|arithmetic, dot = dp4xi8toi32, mma = [<MFMA_F16_16x16x16_F32>, <MFMA_F16_32x32x8_F32>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}> hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute = fp64|fp32|fp16|int64|int32|int16|int8, storage = b64|b32|b16|b8, subgroup = shuffle|arithmetic, dot = dp4xi8toi32, mma = [<MFMA_F16_16x16x16_F32>, <MFMA_F16_32x32x8_F32>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536>>, ukernels = "none", waves_per_eu = 2 : i64}>) { ^ failed to translate executables

Steps to reproduce your issue

compile command: iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-rocm-waves-per-eu=2 --iree-flow-enable-aggressive-fusion --iree-codegen-llvmgpu-use-vector-distribution=false --iree-global-opt-propagate-transposes=true --iree-opt-const-eval=false --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-flow-inline-constants-max-byte-length=1 --iree-preprocessing-pass-pipeline="builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics))" vae_decomp_f32_dps/configured_compiled_vae_decode$async_dispatch_18.mlir -o vae.vmfb

input dispatch IR: https://gist.github.com/IanNod/283d68f9aea0dcb50e94d2b2820bbb21

What component(s) does this issue relate to?

Compiler

Version information

c1e542d6370473244a8fa9178615cb8a6041b489

Additional context

No response

Jun 13 '24 19:06 IanNod