Generic ops are not fused and get pulled into the same dispatch
This is the IR from MobileNetV3. These two (element-wise) generic ops are able to be fused, but they are not. We'd like to fuse them into a single generic op, right?
func @main_dispatch_91() {
%cst = arith.constant 0.000000e+00 : f32
%cst_0 = arith.constant 3.000000e+00 : f32
%cst_1 = arith.constant 6.000000e+00 : f32
%cst_2 = arith.constant 0.166666672 : f32
%c0 = arith.constant 0 : index
%c3468864 = arith.constant 3468864 : index
%c274048 = arith.constant 274048 : index
%c18816 = arith.constant 18816 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:49x96xf32>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c3468864) alignment(64) : !flow.dispatch.tensor<readonly:96x576xf32>
%2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c274048) alignment(64) : !flow.dispatch.tensor<readonly:576xf32>
%3 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c18816) alignment(64) : !flow.dispatch.tensor<writeonly:49x576xf32>
%4 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [49, 96], strides = [1, 1] : !flow.dispatch.tensor<readonly:49x96xf32> -> tensor<49x96xf32>
%5 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [96, 576], strides = [1, 1] : !flow.dispatch.tensor<readonly:96x576xf32> -> tensor<96x576xf32>
%6 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [576], strides = [1] : !flow.dispatch.tensor<readonly:576xf32> -> tensor<576xf32>
%7 = linalg.init_tensor [49, 576] : tensor<49x576xf32>
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<49x576xf32>) -> tensor<49x576xf32>
%9 = linalg.matmul ins(%4, %5 : tensor<49x96xf32>, tensor<96x576xf32>) outs(%8 : tensor<49x576xf32>) -> tensor<49x576xf32>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%6, %9 : tensor<576xf32>, tensor<49x576xf32>) outs(%7 : tensor<49x576xf32>) {
^bb0(%arg0: f32, %arg1: f32, %arg2: f32):
%12 = arith.addf %arg0, %arg1 : f32
linalg.yield %12 : f32
} -> tensor<49x576xf32>
%11 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%10 : tensor<49x576xf32>) outs(%7 : tensor<49x576xf32>) {
^bb0(%arg0: f32, %arg1: f32):
%12 = arith.addf %arg0, %cst_0 : f32
%13 = arith.cmpf olt, %12, %cst : f32
%14 = arith.select %13, %cst, %12 : f32
%15 = arith.cmpf olt, %cst_1, %12 : f32
%16 = arith.select %15, %cst_1, %14 : f32
%17 = arith.mulf %arg0, %16 : f32
%18 = arith.mulf %17, %cst_2 : f32
linalg.yield %18 : f32
} -> tensor<49x576xf32>
flow.dispatch.tensor.store %11, %3, offsets = [0, 0], sizes = [49, 576], strides = [1, 1] : tensor<49x576xf32> -> !flow.dispatch.tensor<writeonly:49x576xf32>
return
}
This probably doesnt happen on top of tree after the latest fusion changes. Though with my PR to fuse multi-use producers I do see this happening. Ideally yes, we'd like to fuse them, but I think backend should also be able to handle these (at least the CPU backend). Ill look into this more deeply after I get back.
Yes, the backend is able to handle it. This is not working with padding prototype. The padding works with top-down order. In CodegenDriver approach (and current Sandbox approach), we have to specify anchor op name to pad operations in top-down order. In this case, we don't want to have two generic ops because CodegenDriver doesn't know which op to pad firstly. This is very fragile. I hope that we're able to fix it after migrating to transform dialect. I have a fallback solution in this case, so nothing is blocking me at this moment.
I had found the same issue but I thought just like Mahesh. The reason was that we have a single dispatch region, and the operations in the region will be flatten out in the dispatch.
@hanhanW @okkwon Hey guys this is marked as a P1 but fairly dated (3mo old) - are we planning to work on this or should we bump to P2?
Bumping it to P2. I dont know if this is still an issue
Closing now. Dont think this is an issue anymore (or is done deliberately and is expected behavior)