heir Improve support for `tensor.extract_slice` in layout/packing system

See #2026 for previous discussion

Jul 18 '25 17:07 AlexanderViand

So I tried this mlir to test the support of the layout system for tensor.extract_slice and tensor.insert_slice:

module {
  func.func @test(%arg0: !secret.secret<tensor<16xi8>>) -> !secret.secret<tensor<11x16xi8>> {
    %0 = tensor.empty() : tensor<11x16xi8>
    %1 = secret.generic(%arg0: !secret.secret<tensor<16xi8>>) {
    ^body(%input0: tensor<16xi8>):
      %extracted_slice = tensor.extract_slice %input0[0] [4] [1] : tensor<16xi8> to tensor<4xi8>
      %inserted_slice = tensor.insert_slice %extracted_slice into %0[0, 0] [1, 4] [1, 1] : tensor<4xi8> into tensor<11x16xi8>
      secret.yield %inserted_slice : tensor<11x16xi8>
    } -> !secret.secret<tensor<11x16xi8>>
    return %1 : !secret.secret<tensor<11x16xi8>>
  }
}

With both the old layout-propagation pass and the new new-layout-propagation pass, I get errors. The error I get with the layout-propagation pass is, unsurprisingly, very similar to the one described by Alex in #2026. The one I get with the new-layout-propagation-pass is:

layout-prop.mlir:7:25: error: 'tensor_ext.convert_layout' op requires tensor rank to match the layout's domain size, but found rank 2 and domain size 1
      %inserted_slice = tensor.insert_slice %extracted_slice into %0[0, 0] [1, 4] [1, 1] : tensor<4xi8> into tensor<11x16xi8>

The mlir that results from running new-layout-propagation on the input mlir from above looks like this:

#new_layout = #tensor_ext.new_layout<domainSize=1, localSize=4, relation="(d0, d1, d2, d3, d4, d5, d6) : (d0 - d3 * 1024 - d4 == 0, -d1 + d3 == 0, d2 - d5 * 16 - d6 == 0, d4 - d6 == 0, d3 * -1024 - d4 + 15 >= 0, d3 * 1024 + d4 >= 0, d1 >= 0, d2 >= 0, -d2 + 1023 >= 0, -d4 + 1023 >= 0, d4 >= 0, -d2 + d5 * 16 + 15 >= 0, d2 - d5 * 16 >= 0)">
#new_layout1 = #tensor_ext.new_layout<domainSize=2, localSize=4, relation="(d0, d1, d2, d3, d4, d5, d6, d7) : (d0 * 16 + d1 - d4 * 1024 - d5 == 0, -d2 + d4 == 0, d3 - d6 * 256 - d7 == 0, d5 - d7 == 0, d1 - d4 * 1024 - d5 + 160 >= 0, -d1 + d4 * 1024 + d5 >= 0, -d1 + 15 >= 0, d1 >= 0, d2 >= 0, d3 >= 0, -d3 + 1023 >= 0, -d5 + 1023 >= 0, d5 >= 0, -d3 + d6 * 256 + 255 >= 0, d3 - d6 * 256 >= 0)">
"builtin.module"() ({
  "func.func"() <{arg_attrs = [{tensor_ext.layout = #new_layout}], function_type = (!secret.secret<tensor<16xi8>>) -> !secret.secret<tensor<11x16xi8>>, res_attrs = [{tensor_ext.layout = #new_layout}], sym_name = "test"}> ({
  ^bb0(%arg0: !secret.secret<tensor<16xi8>>):
    %0 = "tensor.empty"() : () -> tensor<11x16xi8>
    %1 = "secret.generic"(%arg0) ({
    ^bb0(%arg1: tensor<16xi8>):
      %2 = "tensor.extract_slice"(%arg1) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0>, static_sizes = array<i64: 4>, static_strides = array<i64: 1>}> {tensor_ext.layout = #new_layout} : (tensor<16xi8>) -> tensor<4xi8>
      %3 = "tensor_ext.assign_layout"(%0) <{layout = #new_layout1}> {tensor_ext.layout = #new_layout1} : (tensor<11x16xi8>) -> tensor<11x16xi8>
      %4 = "tensor_ext.convert_layout"(%3) <{from_layout = #new_layout1, to_layout = #new_layout}> {tensor_ext.layout = #new_layout} : (tensor<11x16xi8>) -> tensor<11x16xi8>
      %5 = "tensor.insert_slice"(%2, %4) <{operandSegmentSizes = array<i32: 1, 1, 0, 0, 0>, static_offsets = array<i64: 0, 0>, static_sizes = array<i64: 1, 4>, static_strides = array<i64: 1, 1>}> {tensor_ext.layout = #new_layout} : (tensor<4xi8>, tensor<11x16xi8>) -> tensor<11x16xi8>
      "secret.yield"(%5) : (tensor<11x16xi8>) -> ()
    }) {__argattrs = [{tensor_ext.layout = #new_layout}], __resattrs = [{tensor_ext.layout = #new_layout}]} : (!secret.secret<tensor<16xi8>>) -> !secret.secret<tensor<11x16xi8>>
    "func.return"(%1) : (!secret.secret<tensor<11x16xi8>>) -> ()
  }) : () -> ()
}) : () -> ()

I think I kind of understood how the layouts from the old layout-propagation pass worked, e.g.

#layout = #tensor_ext.layout<map = (d0) -> (d0 mod 1024), alignment = #alignment>
#layout1 = #tensor_ext.layout<map = (d0, d1) -> ((d0 * 16 + d1) mod 1024), alignment = #alignment1>

But how do the new layouts work with 7+ dimensional arguments in the relation?

Aug 20 '25 13:08 kragall

It might be worth waiting, as Asra and I are actively working on revamping this system. The new layout format expresses the same layout as before, but it displays the formulas in a "flattened" form right now. We're exploring options for improving the textual form.

Aug 20 '25 13:08 j2kun

Hey, I'm currently exploring this. For extract_slice / insert_slice that are simply rank-reducing or expanding, I'm adding patterns that would canonicalize them to expand/collapse shape ops that would just be new reassociations of the domain vars in the new layout system (pretty trivial).

But now I'm also hitting this for more general cases. For example, if we have a 2-D convolution over a number of channels/filters,

    %1 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%arg0, %cst_3 : tensor<1x1x32x32xf32>, tensor<6x1x10x10xf32>) outs(%broadcasted : tensor<1x6x23x23xf32>) -> tensor<1x6x23x23xf32>

Then we can expand this to a loop over 2-D convolutions, but then we need to insert the results into the output tensor<1x6x23x23xf32>. Naively, someone probably would encode each of the 2-D input channels into a separate ciphertext, and then just keep the output channels in individual ciphertexts as well (so that the output here would be tensor<6x!ct> maybe.

If you run that convolution example in a proper IR you would get this if you ran --torch-linalg-to-ckks:

/google/src/cloud/asraa/conv-heir/google3/third_party/heir/out.debug:17:10: note: see current operation: %22 = "tensor.insert_slice"(%19, %21) <{operandSegmentSizes = array<i32: 1, 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 1, 23, 23>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<23x23xf32>, tensor<1x6x23x23xf32>) -> tensor<1x6x23x23xf32>

Just an update before I start working on the support.

Oct 03 '25 18:10 asraa

IIUC Asra is finishing the last steps of this to support our next ML demo (a conv network)

Oct 16 '25 20:10 j2kun