iree icon indicating copy to clipboard operation
iree copied to clipboard

[DispatchCreation] Failed to hoist out of dispatch in DT-fusion

Open egebeysel opened this issue 6 months ago • 1 comments

What happened?

Compilation fails with error: 'iree_encoding.set_encoding' op failed to hoist the op out of dispatch on the DT-fusion path when the producer of the set_encoding op is cloned inside the dispatch region. This might be the case especially for the DT-fusion pipeline, since the set_encoding op itself is not really present at the time of dispatch formation or cloning producers into dispatches. So the extract_slice -> matmul chain becomes extract -> set_encoding -> matmul and the hoisting out of the dispatch fails.

Steps to reproduce your issue

Here is an artificial reproducer that reproduces the above issue, although this pattern can also occur naturally as a result of some decompositions - or also handwritten layers etc. An example that results in this would be multi-head attention with a three-fold extraction instead of the two above - for Q, K and V.

func.func @foo(%arg1 : tensor<8x8xf32>, %arg2 : tensor<8x8xf32>) -> (tensor<8x8xf32>, tensor<8x8xf32>){
    %arg0 = arith.constant dense_resource<__auto.constant_16_8_torch.float32> : tensor<16x8xf32>
    %extracted_slice = tensor.extract_slice %arg0[0, 0] [8, 8] [1, 1] : tensor<16x8xf32> to tensor<8x8xf32>
    %extracted_slice_1 = tensor.extract_slice %arg0[8, 0] [8, 8] [1, 1] : tensor<16x8xf32> to tensor<8x8xf32>
    %c0 = arith.constant 0.0 : f32
    
    %init = tensor.empty() : tensor<8x8xf32>
    %fill = linalg.fill ins(%c0 : f32) outs(%init : tensor<8x8xf32>) -> tensor<8x8xf32>
    %res = linalg.matmul_transpose_b ins(%extracted_slice, %arg1 : tensor<8x8xf32>, tensor<8x8xf32>) outs(%fill : tensor<8x8xf32>) -> tensor<8x8xf32>
    
    %init_1 = tensor.empty() : tensor<8x8xf32>
    %fill_1 = linalg.fill ins(%c0 : f32) outs(%init_1 : tensor<8x8xf32>) -> tensor<8x8xf32>
    %res_1 = linalg.matmul_transpose_b ins(%extracted_slice_1, %arg2 : tensor<8x8xf32>, tensor<8x8xf32>) outs(%fill_1 : tensor<8x8xf32>) -> tensor<8x8xf32>

    return %res, %res_1 : tensor<8x8xf32>, tensor<8x8xf32>
}

To compile:


iree-compile \
    --iree-input-type=auto \
    --iree-hal-target-backends=llvm-cpu \
    --iree-llvmcpu-target-cpu-features=host \
    --iree-opt-data-tiling=false \
    --iree-dispatch-creation-experimental-data-tiling=true \
    --mlir-print-ir-after-all \
    --mlir-print-ir-after-change \
    --mlir-elide-elementsattrs-if-larger=10 \
    --mlir-elide-resource-strings-if-larger=10 \
    ~/reproducer.mlir -o ~/reproducer.vmfb 2> ~/reproducer-logs.mlir

This results in the following snippet where the set_encoding op cannot be hoisted out of the dispatch because its producer is also inside the same dispatch region:

...
%2 = flow.dispatch.region -> (tensor<8x8xf32>) {
    %cst_0 = arith.constant 0.000000e+00 : f32
    %extracted_slice = tensor.extract_slice %cst[0, 0] [8, 8] [1, 1] : tensor<16x8xf32> to tensor<8x8xf32>
    %6 = iree_encoding.set_encoding %extracted_slice : tensor<8x8xf32> -> tensor<8x8xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>
    %7 = iree_encoding.set_encoding %0 : tensor<8x8xf32> -> tensor<8x8xf32, #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>
    %8 = tensor.empty() : tensor<8x8xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>
    %9 = linalg.fill ins(%cst_0 : f32) outs(%8 : tensor<8x8xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>) -> tensor<8x8xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>
    %10 = linalg.matmul_transpose_b ins(%6, %7 : tensor<8x8xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>, tensor<8x8xf32, #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>) outs(%9 : tensor<8x8xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>) -> tensor<8x8xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>>
    %11 = iree_encoding.unset_encoding %10 : tensor<8x8xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iteration_sizes = [8, 8, 8]>> -> tensor<8x8xf32>
    flow.return %11 : tensor<8x8xf32>
  }
...

What component(s) does this issue relate to?

Compiler

Version information

No response

Additional context

AFAICS, there is a TODO in the HoistEncodingOps pass that annotates that we should be able to hoist out entire slices/chains of IR out of the dispatch, which we could do over here. These would (maybe with additional changes in the Stream MaterializeEncodings pass) then get their own dispatches later.

On the other hand, I think the case where these would actually not be hoisted out of the dispatch should also be investigated. I understand the problem with fusing pack operations as producers with some tile sizes (see here), but I believe the tile sizes we choose should be divisible by the inner tile sizes, which also came up during a discussion about the consumer fusion counterpart of the pack operation.

So I could think of two options for this specific case over here:

  • The extract_slice ops (or maybe in other cases some other op), get cloned into the same region as the matmul in the CloneProducersIntoDispatchRegionsPass, we could preemptively block the cloning of operations that in turn lead to such sequences of unhoistable ops as above. I'm not entirely sure of the assumptions one would have to make here to guarantee that we indeed would have the set_encoding op materializing into something and not becoming a nop.
  • As the TODO above suggests, add the mechanism to hoist out the sequence of IR that leads to the set_encoding op out of the dispatch and possibly adjust the materialization pass to handle these sequences as well. I'm not sure how well this would actually work if we have, e.g. two separate set of encodings for the two (or possibly more) slices extracted that possibly require padding.

For the longer run, I think it might also be sensible to revisit lowering configs/codegen for the case of pack -> mmt4d -> unpack instead of bailing on compilation. I'm sure there are problems here that I'm overseeing, so any clarification/pointers here would be wildly appreciated :)

Also, I'm aware that the correct solution here might also be "don't end up with such sequnce of IR, have two separate buffers/inputs instead of extracting from the same buffer", but I'd also like to ask if such patterns might occur elsewhere maybe more frequently and naturally?

egebeysel avatar Jun 16 '25 08:06 egebeysel

This may be the case that we don't hoist the set_encoding ops out. Otherwise, it increases memory footprint. I don't know what happens if we clean the TODO. I don't clean the TODO because we want to revisit the hoisting/const-eval mechanism, but it is not happening atm.

There are cases that we are not able to hoist set_encoding op. E.g., if the source is a gather op. I think there are two missing features, and they are independent:

  1. Encoding propagation.
  2. Do not hoist the set encoding op if the source is in the dispatch region.

We will need to handle (2) when the hoisting can't happen. I'd go with (2) and try to fix codegen issue if I have cycles.

hanhanW avatar Jun 18 '25 21:06 hanhanW

Just leaving this here so it's easier to follow: https://github.com/iree-org/iree/pull/21181, I believe you're handling the case where we would try and hoist the slice of IR out of the dispatch region here.

egebeysel avatar Jun 27 '25 09:06 egebeysel

Just leaving this here so it's easier to follow: #21181, I believe you're handling the case where we would try and hoist the slice of IR out of the dispatch region here.

Yes, my plan is hoisting two cases:

  1. extract_slice op that can be converted to flow ops.
  2. Kick in the const analysis and hoist the chain, if the whole slice is hoistable.

I'll keep the rest of the encoding ops in their dispatch, and see how it goes.

hanhanW avatar Jun 27 '25 17:06 hanhanW