iree Generic ops are not fused and get pulled into the same dispatch

This is the IR from MobileNetV3. These two (element-wise) generic ops are able to be fused, but they are not. We'd like to fuse them into a single generic op, right?

func @main_dispatch_91() {
  %cst = arith.constant 0.000000e+00 : f32
  %cst_0 = arith.constant 3.000000e+00 : f32
  %cst_1 = arith.constant 6.000000e+00 : f32
  %cst_2 = arith.constant 0.166666672 : f32
  %c0 = arith.constant 0 : index
  %c3468864 = arith.constant 3468864 : index
  %c274048 = arith.constant 274048 : index
  %c18816 = arith.constant 18816 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:49x96xf32>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c3468864) alignment(64) : !flow.dispatch.tensor<readonly:96x576xf32>
  %2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c274048) alignment(64) : !flow.dispatch.tensor<readonly:576xf32>
  %3 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c18816) alignment(64) : !flow.dispatch.tensor<writeonly:49x576xf32>
  %4 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [49, 96], strides = [1, 1] : !flow.dispatch.tensor<readonly:49x96xf32> -> tensor<49x96xf32>
  %5 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [96, 576], strides = [1, 1] : !flow.dispatch.tensor<readonly:96x576xf32> -> tensor<96x576xf32>
  %6 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [576], strides = [1] : !flow.dispatch.tensor<readonly:576xf32> -> tensor<576xf32>
  %7 = linalg.init_tensor [49, 576] : tensor<49x576xf32>
  %8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<49x576xf32>) -> tensor<49x576xf32>
  %9 = linalg.matmul ins(%4, %5 : tensor<49x96xf32>, tensor<96x576xf32>) outs(%8 : tensor<49x576xf32>) -> tensor<49x576xf32>
  %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%6, %9 : tensor<576xf32>, tensor<49x576xf32>) outs(%7 : tensor<49x576xf32>) {
  ^bb0(%arg0: f32, %arg1: f32, %arg2: f32):
    %12 = arith.addf %arg0, %arg1 : f32
    linalg.yield %12 : f32
  } -> tensor<49x576xf32>
  %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%10 : tensor<49x576xf32>) outs(%7 : tensor<49x576xf32>) {
  ^bb0(%arg0: f32, %arg1: f32):
    %12 = arith.addf %arg0, %cst_0 : f32
    %13 = arith.cmpf olt, %12, %cst : f32
    %14 = arith.select %13, %cst, %12 : f32
    %15 = arith.cmpf olt, %cst_1, %12 : f32
    %16 = arith.select %15, %cst_1, %14 : f32
    %17 = arith.mulf %arg0, %16 : f32
    %18 = arith.mulf %17, %cst_2 : f32
    linalg.yield %18 : f32
  } -> tensor<49x576xf32>
  flow.dispatch.tensor.store %11, %3, offsets = [0, 0], sizes = [49, 576], strides = [1, 1] : tensor<49x576xf32> -> !flow.dispatch.tensor<writeonly:49x576xf32>
  return
}

Apr 25 '22 06:04 hanhanW

This probably doesnt happen on top of tree after the latest fusion changes. Though with my PR to fuse multi-use producers I do see this happening. Ideally yes, we'd like to fuse them, but I think backend should also be able to handle these (at least the CPU backend). Ill look into this more deeply after I get back.

Apr 25 '22 23:04 MaheshRavishankar

Yes, the backend is able to handle it. This is not working with padding prototype. The padding works with top-down order. In CodegenDriver approach (and current Sandbox approach), we have to specify anchor op name to pad operations in top-down order. In this case, we don't want to have two generic ops because CodegenDriver doesn't know which op to pad firstly. This is very fragile. I hope that we're able to fix it after migrating to transform dialect. I have a fallback solution in this case, so nothing is blocking me at this moment.

Apr 26 '22 01:04 hanhanW

I had found the same issue but I thought just like Mahesh. The reason was that we have a single dispatch region, and the operations in the region will be flatten out in the dispatch.

Apr 26 '22 23:04 okkwon

@hanhanW @okkwon Hey guys this is marked as a P1 but fairly dated (3mo old) - are we planning to work on this or should we bump to P2?

Aug 04 '22 20:08 allieculp

Bumping it to P2. I dont know if this is still an issue

Aug 04 '22 21:08 MaheshRavishankar

Closing now. Dont think this is an issue anymore (or is done deliberately and is expected behavior)

Sep 29 '22 22:09 MaheshRavishankar