[CPU] `transpose -> pack` folding pattern inhibits fusion
What happened?
Due to a folding pattern, namely FoldConsumerPackWithProducerLinalgTransposeOp that's applied after encoding materialization here, tiling-level fusion doesn't happen and compilation fails at vector size legality verification.
Steps to reproduce your issue
With the following reproducer:
func.func @foo(%arg0 : tensor<512x256xf32>, %arg1 : tensor<256x512xf32>) -> (tensor<256x256xf32>, tensor<256x512xf32>, tensor<256x512xf32>) {
%0 = tensor.empty() : tensor<256x512xf32>
%1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<512x256xf32>) outs(%0 : tensor<256x512xf32>) {
^bb0(%in: f32 , %out: f32 ):
linalg.yield %in : f32
} -> tensor<256x512xf32>
%empty = tensor.empty() : tensor<256x256xf32>
%cst = arith.constant 0.0 : f32
%fill = linalg.fill ins(%cst : f32) outs(%empty : tensor<256x256xf32>) -> tensor<256x256xf32>
%2 = linalg.matmul_transpose_b ins(%1, %arg1 : tensor<256x512xf32>, tensor<256x512xf32>) outs(%fill : tensor<256x256xf32>) -> tensor<256x256xf32>
%empty1 = tensor.empty() : tensor<256x512xf32>
%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%1 : tensor<256x512xf32>) outs(%empty1: tensor<256x512xf32>) {
^bb0(%in: f32 , %out: f32 ):
%res = arith.addf %in, %in : f32
linalg.yield %res : f32
} -> tensor<256x512xf32>
return %2, %1, %3 : tensor<256x256xf32>, tensor<256x512xf32>, tensor<256x512xf32>
}
and the command:
iree-compile --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=generic --iree-opt-data-tiling=false --iree-dispatch-creation-experimental-data-tiling --iree-llvmcpu-enable-ukernels=none ~/reproducer.mlir --mlir-print-ir-after-all --mlir-print-ir-after-change 2> ~/reproducer-logs.mlir
What component(s) does this issue relate to?
Compiler
Version information
Additional context
We have a dispatch with transpose -> pack where both of their results are stored to global buffers. FoldConsumerPackWithProducerLinalgTransposeOp folds that transpose-pack chain into a pack, because pack also has transpose semantics. Although since we have a multi-result dispatch where you also need the intermediate result of the transpose, you actually end up with something like:
producer
/ \
transpose pack
| |
store store
As a result, since generic isn't the producer of the pack op anymore, the transpose ends up not getting fused in on thread-level tiling and compilation fails. Dump:
// -----// IR Dump Before MaterializeDeviceEncodingPass (iree-codegen-materialize-device-encoding) //-----
func.func @foo_dispatch_0_transpose_512x256_f32() {
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<512x256xf32>>
%1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32>>
%2 = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32, #encoding>>
%3 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [512, 256], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<512x256xf32>> -> tensor<512x256xf32>
%4 = tensor.empty() : tensor<256x512xf32>
%5 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel"]} ins(%3 : tensor<512x256xf32>) outs(%4 : tensor<256x512xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<256x512xf32>
%6 = iree_encoding.set_encoding %5 : tensor<256x512xf32> -> tensor<256x512xf32, #encoding2>
iree_tensor_ext.dispatch.tensor.store %5, %1, offsets = [0, 0], sizes = [256, 512], strides = [1, 1] : tensor<256x512xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32>>
iree_tensor_ext.dispatch.tensor.store %6, %2, offsets = [0, 0], sizes = [256, 512], strides = [1, 1] : tensor<256x512xf32, #encoding2> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32, #encoding>>
return
}
// -----// IR Dump After MaterializeDeviceEncodingPass (iree-codegen-materialize-device-encoding) //----- //
func.func @foo_dispatch_0_transpose_512x256_f32() {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<512x256xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32>>
%2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<32x512x8x1xf32>>
%3 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [512, 256], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<512x256xf32>> -> tensor<512x256xf32>
%4 = tensor.empty() : tensor<256x512xf32>
%5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"]} ins(%3 : tensor<512x256xf32>) outs(%4 : tensor<256x512xf32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<256x512xf32>
%6 = tensor.empty() : tensor<32x512x8x1xf32>
%pack = linalg.pack %3 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [8, 1] into %6 : tensor<512x256xf32> -> tensor<32x512x8x1xf32>
iree_tensor_ext.dispatch.tensor.store %5, %1, offsets = [0, 0], sizes = [256, 512], strides = [1, 1] : tensor<256x512xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32>>
iree_tensor_ext.dispatch.tensor.store %pack, %2, offsets = [0, 0, 0, 0], sizes = [32, 512, 8, 1], strides = [1, 1, 1, 1] : tensor<32x512x8x1xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<32x512x8x1xf32>>
return
}
// -----// IR Dump After TileAndDistributeToWorkgroupsUsingForallOpPass (iree-codegen-tile-and-distribute-to-workgroups-using-forall-op) //----- //
func.func @foo_dispatch_0_transpose_512x256_f32() attributes {translation_info = #iree_codegen.translation_info<pipeline = CPUDoubleTilingExpert>} {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<512x256xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32>>
%2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<32x512x8x1xf32>>
%3 = iree_tensor_ext.dispatch.tensor.load %0, offsets = [0, 0], sizes = [512, 256], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<512x256xf32>> -> tensor<512x256xf32>
%4 = tensor.empty() : tensor<256x512xf32>
%5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"]} ins(%3 : tensor<512x256xf32>) outs(%4 : tensor<256x512xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [1, 8], [0, 0], [0, 0]]>} {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
} -> tensor<256x512xf32>
%6 = tensor.empty() : tensor<32x512x8x1xf32>
%7 = scf.forall (%arg0, %arg1) = (0, 0) to (32, 512) step (8, 64) shared_outs(%arg2 = %6) -> (tensor<32x512x8x1xf32>) {
%8 = affine.min affine_map<(d0) -> (-d0 + 512, 64)>(%arg1)
%9 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg0)
%10 = affine.min affine_map<(d0) -> (d0 * -8 + 256, 64)>(%arg0)
%extracted_slice = tensor.extract_slice %3[%arg1, %9] [%8, %10] [1, 1] : tensor<512x256xf32> to tensor<?x?xf32>
%extracted_slice_0 = tensor.extract_slice %arg2[%arg0, %arg1, 0, 0] [8, 64, 8, 1] [1, 1, 1, 1] : tensor<32x512x8x1xf32> to tensor<8x64x8x1xf32>
%pack = linalg.pack %extracted_slice padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [8, 1] into %extracted_slice_0 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[8, 64], [1, 1], [0, 0], [0, 0]]>} : tensor<?x?xf32> -> tensor<8x64x8x1xf32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %pack into %arg2[%arg0, %arg1, 0, 0] [8, 64, 8, 1] [1, 1, 1, 1] : tensor<8x64x8x1xf32> into tensor<32x512x8x1xf32>
}
} {mapping = [#iree_codegen.workgroup_mapping<y>, #iree_codegen.workgroup_mapping<x>]}
iree_tensor_ext.dispatch.tensor.store %5, %1, offsets = [0, 0], sizes = [256, 512], strides = [1, 1] : tensor<256x512xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<256x512xf32>>
iree_tensor_ext.dispatch.tensor.store %7, %2, offsets = [0, 0, 0, 0], sizes = [32, 512, 8, 1], strides = [1, 1, 1, 1] : tensor<32x512x8x1xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<32x512x8x1xf32>>
return
}
Now I believe there are multiple ways to tackle this, and I would like some feedback here from you folks on which one is more sensible, or if you have other recommendations:
-
One can adjust the upstream
FoldConsumerPackWithProducerLinalgTransposeOppattern in MLIR to only apply this pattern if the producer transpose is going to be folded away, i.e. doesn't have a consumer. But this can obviously inhibit some further folding patterns, e.g. if the transpose only has 2 pack consumers, then one can essentially apply this pattern twice and then get rid of the transpose, which wouldn't happen anymore. -
We can add the check of "is the transpose going to be folded away or not" to the IREE side where we use the pattern, namely
MaterializeDeviceEncodings. If the check fails, we don't run the pattern. -
One can try and support fusion in such cases, but I don't really see this being the best option, or at least a long shot considering the current state of tiling, lowering config propagation etc.
-
One can revisit the dispatch creation patterns, to not have the pack op being cloned into the dispatch to begin with. I guess this one comes with the penalty of (re-)loading the input (or the result of the transpose) twice, but also better than failing to compile.
P.S.: I observed the same error on multiple pytorch actual models, could also name a few if that's going to help :)
@hanhanW mind taking a look at this if/when you have time? I'll gladly take over the implementation part if I could get a second opinion on which direction to go :)
There is a potential issue about memory footprint when doing the data-tiling. In your use case, it may increase the total memory usage if we end up with such dispatch. It is okay if memory footprint is not a concern. It may be tricky to revisit the idea at model level.
In the local dispatch scope, I'd add a controlFn that allows the folding only when the transpose op has a single use.
(Sorry, I was going to think more about the memory footprint issue and see if I can make a better suggestion. However, I was pulled into other issues and integrate duties. I still don't have a good suggestion today.)
P.S.: I observed the same error on multiple pytorch actual models, could also name a few if that's going to help :)
Can you name a few of them? I think I'll use llama and sdxl to think about memory footprint problem. Having more examples may help me explore the ideas.
We hit a memory footprint issue in llama for padding encoding strategy, so today we only enable the encoding on LHS for this approach. We want to enable the encodings on RHS for sure. However, we observe the issue about using too much memory when we do hoisting. It is fixable, but it takes some time. https://github.com/iree-org/iree/issues/20439
Absolutely no problem, thanks for the reply :)
Can you name a few of them? I think I'll use llama and sdxl to think about memory footprint problem. Having more examples may help me explore the ideas.
openai-community/gpt2 would be an example for that. I use the aot.export path from turbine. EDIT: also stabilityai/stable-code-3b would be an example.
There is a potential issue about memory footprint when doing the data-tiling. In your use case, it may increase the total memory usage if we end up with such dispatch.
Do you mean because the transpose op possibly has multiple encodings that each get hoisted/cloned into initializers and potentially get duplicated therefore?
Speaking of memory footprint btw. I believe there was an issue and/or difference of the dt-fusion pipeline with the current default (CPU) pipeline in means of the weight encodings being constant folded in the latter but hoisted into initializers in the former. On that regard, is it normal that the peak memory consumption I get from benchmarking (on CPU) of the DT-fusion pipeline is almost the same (or in the order of magnitude at least) with the model size whereas this is usually way less with the current default DT pipeline?
Do you mean because the transpose op possibly has multiple encodings that each get hoisted/cloned into initializers and potentially get duplicates?
No. What I meant is that the transpose could have at least two users. One is matmul, and the other may be other dispatches like element-wise/reduction/extract_slice/etc.
flowchart LR
transpose --> matmul
transpose --> reduction
Now we introduce data-tiling, so there is a set_encoding op in between. If we fuse the set_encoding op with the transpose op, IREE needs to allocate an additional buffer to hold the result of encoded transpose dispatch and pass it to the matmul.
flowchart LR
transpose --> set_encoding
set_encoding --> matmul
transpose --> reduction
This is a very involved issue, because sometimes it can be fixed by model side. E.g., the reduction may be reshape or slice or something else, then reordering some ops may help.
Oh I see okay, thanks! Let me know if I can do anything on that regard as well, but I think for this specific case I'll go the controlFn direction!
Speaking of memory footprint btw. I believe there was an issue and/or difference of the dt-fusion pipeline with the current default (CPU) pipeline in means of the weight encodings being constant folded in the latter but hoisted into initializers in the former. On that regard, is it normal that the peak memory consumption I get from benchmarking (on CPU) of the DT-fusion pipeline is almost the same (or in the order of magnitude at least) with the model size whereas this is usually way less with the current default DT pipeline?
I'm not sure how you benchmark the peak memory consumption, but I may know what the issue is. First, I think we should set the baseline to compilation without data-tiling, and say that we use X memory in total, and there are Y bytes for weights.
Scenario 1: The weights are not embedded:
In this case, you can't do const-evaluation because they are only available during runtime. What we can do here is hoisting them to initializers during compilation, and evaluate them once in the init stage. Then we start the real execution/inference.
The current default DT pipeline could have less memory consumption comparing the dt-fusion pipeline, because we have materialized encodings into pack/unpack ops and some of them can be folded away like reshapes. The peak memory consumption could be 2Y in initialization, and it should be in the same order of the baseline. We don't run encoding propagation for DT-fusion pipeline at the moment, so it can have more memory allocation ,especially when encoding ops are not fused. This also results in 2Y in initialization and the total memory usage is at the O(X) order.
Scenario 2: The weights are embedded:
In this case, the current default DT pipeline can do const-evaluation. The 2Y memory usage is hidden in the compilation time. Thus, you'll likely see O(X) memory usage during execution. Note: it's O(X + 2Y) in the scenario 1 where the weights are not provided as irpa files.
However, IREE does not support const-evaluation after dispatch creation -- which needs to be fixed but it is not trivial. In this case, the DT-fusion pipeline can only hoist them to initializiers and follow the scenario 1 flow. In this case, the memory usage is O(X + 2Y) during the execution.
Hopefully, this answer your question. :)
Yes! That answers my question, I was talking about scenario 2 :) Thanks!
Btw. I have the fix ready for this one, will submit it today or tomorrow to upstream llvm and then here :) You can assign the issue to me if that's necessary @hanhanW
It is not necessary, but I think that it is good to assign it to you because you are working on it.