iree icon indicating copy to clipboard operation
iree copied to clipboard

topk: failed to bufferize op

Open Alex-Vasile opened this issue 6 months ago • 2 comments

What happened?

I have attached two almost identical MLIRs which contain a TOPK op. One passes and one fails. The passing case performs a a sigmod op on the inputs to topk whereas the failing one does not.

failing.mlir:43:25: error: failed to bufferize op
    %values, %indices = torch.aten.topk %12, %int2, %int-1_7, %true, %true_8 : !torch.vtensor<[?,4],f32>, !torch.int, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[?,2],f32>, !torch.vtensor<[?,2],si64>
                        ^
failing.mlir:43:25: note: see current operation: %20 = "bufferization.alloc_tensor"(%17) <{operandSegmentSizes = array<i32: 0, 1, 0>}> : (tensor<?x4xf32>) -> tensor<?x4xf32>
failing.mlir:43:25: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {abi = "hip", iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F32_16x16x16_BF16>, <MFMA_F32_32x32x8_BF16>, <MFMA_F32_16x16x32_F8E5M2FNUZ>, <MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_I32_16x16x32_I8>, <MFMA_I32_32x32x16_I8>, <MFMA_F64_16x16x4_F64>, <MFMA_F32_16x16x4_F32>, <MFMA_F32_16x16x16_F16>, <MFMA_F32_32x32x8_F16>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647], max_load_instruction_bits = 128, simds_per_wgp = 4, vgpr_space_bits = 16384>>, iree_codegen.default_tuning_spec = #rocm.builtin.tuning_module<"iree_default_tuning_spec_gfx942.mlir">, ukernels = "none"}>
    %values, %indices = torch.aten.topk %12, %int2, %int-1_7, %true, %true_8 : !torch.vtensor<[?,4],f32>, !torch.int, !torch.int, !torch.bool, !torch.bool -> !torch.vtensor<[?,2],f32>, !torch.vtensor<[?,2],si64>

Steps to reproduce your issue

Compile each MLIR using the following:

iree-compile case.mlir -o=model.vmfb --iree-hal-target-device=hip --iree-hip-target=gfx942 --iree-opt-level=O3 --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hal-memoization=true

What component(s) does this issue relate to?

Compiler

Version information

iree-3.5.0rc20250605

Additional context

No response

Alex-Vasile avatar Jun 05 '25 17:06 Alex-Vasile

The top_k becomes sort ops, do we really not convert it to linalg_ext.top_k op?

The difference is that in passing.mlir version, two readwrite buffers are passed and used by the sort op. In the failing.mlir version, it passes three buffers - readwrite, readonly, writeonly. Since the sort op uses the readonly buffer in outs, we will fail even we fix the bufferization issue. (verified on CPU backend)

I think the proper fix is to track back dispatch creation and figure why it uses readonly tensor in sort op.

Log: https://gist.github.com/hanhanW/dbe50c12996ea071f57961609444ee95

hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {abi = "hip", iree.gpu.target = #iree_gpu.target<arch = "gfx942", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<MFMA_F32_16x16x16_BF16>, <MFMA_F32_32x32x8_BF16>, <MFMA_F32_16x16x32_F8E5M2FNUZ>, <MFMA_F32_16x16x32_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ>, <MFMA_F32_16x16x32_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ>, <MFMA_F32_32x32x16_F8E5M2FNUZ_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ>, <MFMA_F32_32x32x16_F8E4M3FNUZ_F8E5M2FNUZ>, <MFMA_I32_16x16x32_I8>, <MFMA_I32_32x32x16_I8>, <MFMA_F64_16x16x4_F64>, <MFMA_F32_16x16x4_F32>, <MFMA_F32_16x16x16_F16>, <MFMA_F32_32x32x8_F16>], subgroup_size_choices = [64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647], max_load_instruction_bits = 128, simds_per_wgp = 4, vgpr_space_bits = 16384>>, iree_codegen.default_tuning_spec = #rocm.builtin.tuning_module<"iree_default_tuning_spec_gfx942.mlir">, ukernels = "none"}>) {
  hal.executable.export public @prefill_bs3$async_dispatch_2_sort_Dx4xf32 ordinal(0) layout(#hal.pipeline.layout<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) count(%arg0: !hal.device, %arg1: index, %arg2: index) -> (index, index, index) {
    %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice %arg1, %arg2
    hal.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @prefill_bs3$async_dispatch_2_sort_Dx4xf32() {
      %c0 = arith.constant 0 : index
      %0 = hal.interface.constant.load layout(<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) ordinal(0) : i32
      %1 = hal.interface.constant.load layout(<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) ordinal(1) : i32
      %2 = hal.interface.constant.load layout(<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) ordinal(2) : i32
      %3 = arith.index_castui %0 : i32 to index
      %4 = arith.index_castui %1 : i32 to index
      %5 = arith.index_castui %2 : i32 to index
      %6 = arith.index_castui %2 : i32 to index
      %7:4 = util.assume.int 
          %3<umin = 12288, umax = 86016, udiv = 12288>, 
          %4<umin = 15360, umax = 107520>, 
          %5<umin = 96, umax = 672, udiv = 96>, 
          %6<umin = 96, umax = 672, udiv = 96>
        : index, index, index, index
      %8 = iree_tensor_ext.dispatch.workload.ordinal %7#2, 0 : index
      %9 = iree_tensor_ext.dispatch.workload.ordinal %7#3, 1 : index
      %10 = hal.interface.binding.subspan layout(<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x32xf32>>{%8}
      %11 = hal.interface.binding.subspan layout(<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%7#0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<?x4xi64>>{%9}
      %12 = hal.interface.binding.subspan layout(<constants = 3, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%7#1) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x4xf32>>{%9}
      %13 = iree_tensor_ext.dispatch.tensor.load %11, offsets = [0, 0], sizes = [%9, 4], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readwrite:tensor<?x4xi64>>{%9} -> tensor<?x4xi64>
      %14 = iree_tensor_ext.dispatch.tensor.load %10, offsets = [0, 0], sizes = [%9, 4], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x32xf32>>{%8} -> tensor<?x4xf32>
      %15:2 = iree_linalg_ext.sort dimension(1) outs(%14, %13 : tensor<?x4xf32>, tensor<?x4xi64>) {
      ^bb0(%arg0: f32, %arg1: f32, %arg2: i64, %arg3: i64):
        %16 = arith.cmpf oge, %arg0, %arg1 : f32
        iree_linalg_ext.yield %16 : i1
      } -> tensor<?x4xf32>, tensor<?x4xi64>
      iree_tensor_ext.dispatch.tensor.store %15#0, %12, offsets = [0, 0], sizes = [%9, 4], strides = [1, 1] : tensor<?x4xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x4xf32>>{%9}
      iree_tensor_ext.dispatch.tensor.store %15#1, %11, offsets = [0, 0], sizes = [%9, 4], strides = [1, 1] : tensor<?x4xi64> -> !iree_tensor_ext.dispatch.tensor<readwrite:tensor<?x4xi64>>{%9}
      return
    }
  }
}

hanhanW avatar Jun 19 '25 00:06 hanhanW

Here is the IR before and after ConvertDispatchRegionsToWorkgroupsPass; the readonly tensor is introduced by the pass.

https://gist.github.com/hanhanW/ed9aa991eb72527b42b58f1d9e268190

hanhanW avatar Jun 19 '25 00:06 hanhanW