iree [compile][cpu]:error: 'memref.alloca' op expected no unbounded stack allocations

What happened?

for the given IR

module {
  func.func @"torch-jit-export"(%arg0: !torch.vtensor<[?,384],si64>, %arg1: !torch.vtensor<[?,384],si64>, %arg2: !torch.vtensor<[?,384],si64>, %arg3:!torch.vtensor<[?,16,384,64],f32>, %arg4:!torch.vtensor<[?,16,384,384],f32>, %arg5:!torch.vtensor<[1024,1024],f32>, %arg6:!torch.vtensor<[1024],f32>, %arg7:!torch.vtensor<[1],si64>,%arg8:!torch.vtensor<[3],si64>) -> !torch.vtensor<[?,384,1024],f32>   attributes {torch.onnx_meta.ir_version = 4 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.producer_name = "pytorch", torch.onnx_meta.producer_version = "1.3"} {
    %512 = torch.operator "onnx.MatMul"(%arg4, %arg3) : (!torch.vtensor<[?,16,384,384],f32>, !torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,16,384,64],f32> 
    %513 = torch.operator "onnx.Transpose"(%512) {torch.onnx.perm = [0 : si64, 2 : si64, 1 : si64, 3 : si64]} : (!torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,384,16,64],f32> 
    %514 = torch.operator "onnx.Shape"(%513) : (!torch.vtensor<[?,384,16,64],f32>) -> !torch.vtensor<[4],si64> 
    %515 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<__21> : tensor<si64>} : () -> !torch.vtensor<[],si64> 
    %516 = torch.operator "onnx.Gather"(%514, %515) {torch.onnx.axis = 0 : si64} : (!torch.vtensor<[4],si64>, !torch.vtensor<[],si64>) -> !torch.vtensor<[],si64> 
    %517 = torch.operator "onnx.Shape"(%513) : (!torch.vtensor<[?,384,16,64],f32>) -> !torch.vtensor<[4],si64> 
    %518 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<__22> : tensor<si64>} : () -> !torch.vtensor<[],si64> 
    %519 = torch.operator "onnx.Gather"(%517, %518) {torch.onnx.axis = 0 : si64} : (!torch.vtensor<[4],si64>, !torch.vtensor<[],si64>) -> !torch.vtensor<[],si64> 
    %520 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<__23> : tensor<si64>} : () -> !torch.vtensor<[],si64> 
    %522 = torch.operator "onnx.Unsqueeze"(%516, %arg7) : (!torch.vtensor<[],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1],si64> 
    %524 = torch.operator "onnx.Unsqueeze"(%519, %arg7) : (!torch.vtensor<[],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1],si64> 
    %526 = torch.operator "onnx.Unsqueeze"(%520, %arg7) : (!torch.vtensor<[],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1],si64> 
    %527 = torch.operator "onnx.Concat"(%522, %524, %526) {torch.onnx.axis = 0 : si64} : (!torch.vtensor<[1],si64>, !torch.vtensor<[1],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[3],si64> 
    %528 = torch.operator "onnx.Reshape"(%513, %527) : (!torch.vtensor<[?,384,16,64],f32>, !torch.vtensor<[3],si64>) -> !torch.vtensor<[?,384,1024],f32> 
    %530 = torch.operator "onnx.MatMul"(%528, %arg5) : (!torch.vtensor<[?,384,1024],f32>, !torch.vtensor<[1024,1024],f32>) -> !torch.vtensor<[?,384,1024],f32> 
    %531 = torch.operator "onnx.Add"(%530, %arg6) : (!torch.vtensor<[?,384,1024],f32>, !torch.vtensor<[1024],f32>) -> !torch.vtensor<[?,384,1024],f32> 
    return %531: !torch.vtensor<[?,384,1024],f32>
  }
}

{-#
  dialect_resources: {
    builtin: {
      __21: "0x080000000000000000000000",
      __22: "0x080000000100000000000000",
      __23: "0x080000000004000000000000"
    }
  }
#-}

getting error as

model.torch_onnx.mlir:3:12: error: 'memref.alloca' op expected no unbounded stack allocations
    %512 = torch.operator "onnx.MatMul"(%arg4, %arg3) : (!torch.vtensor<[?,16,384,384],f32>, !torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,16,384,64],f32>

IR after failure:

// -----// IR Dump After LLVMCPUCheckIRBeforeLLVMConversionPass Failed (iree-llvmcpu-check-ir-before-llvm-conversion) //----- //
func.func @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
  %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
  %c7 = arith.constant 7 : index
  %c6 = arith.constant 6 : index
  %c5 = arith.constant 5 : index
  %c3 = arith.constant 3 : index
  %c32_i64 = arith.constant 32 : i64
  %c48 = arith.constant 48 : index
  %c16 = arith.constant 16 : index
  %c0 = arith.constant 0 : index
  %c2 = arith.constant 2 : index
  %c32 = arith.constant 32 : index
  %c64 = arith.constant 64 : index
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %alloca = memref.alloca() {alignment = 64 : i64} : memref<1x1x8x4xf32>
  %0 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(0) : i32
  %1 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(1) : i32
  %2 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(2) : i32
  %3 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(3) : i32
  %4 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(4) : i32
  %5 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(5) : i32
  %6 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(6) : i32
  %7 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(7) : i32
  %8 = arith.extui %0 : i32 to i64
  %9 = arith.extui %1 : i32 to i64
  %10 = arith.shli %9, %c32_i64 : i64
  %11 = arith.ori %8, %10 : i64
  %12 = arith.index_castui %11 : i64 to index
  %13 = arith.extui %2 : i32 to i64
  %14 = arith.extui %3 : i32 to i64
  %15 = arith.shli %14, %c32_i64 : i64
  %16 = arith.ori %13, %15 : i64
  %17 = arith.index_castui %16 : i64 to index
  %18 = arith.extui %4 : i32 to i64
  %19 = arith.extui %5 : i32 to i64
  %20 = arith.shli %19, %c32_i64 : i64
  %21 = arith.ori %18, %20 : i64
  %22 = arith.index_castui %21 : i64 to index
  %23 = arith.extui %6 : i32 to i64
  %24 = arith.extui %7 : i32 to i64
  %25 = arith.shli %24, %c32_i64 : i64
  %26 = arith.ori %23, %25 : i64
  %27 = arith.index_castui %26 : i64 to index
  %28 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%12) flags("ReadOnly|Indirect") : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>{%27}
  memref.assume_alignment %28, 1 : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
  %29 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%17) flags(Indirect) : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>{%27}
  memref.assume_alignment %29, 1 : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
  %30 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%22]
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %31 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
  %32 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
  %33 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
  %34 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
  %alloca_0 = memref.alloca(%30) {alignment = 64 : i64} : memref<?x2x32x64xf32>
  scf.for %arg0 = %31 to %c48 step %32 {
    scf.for %arg1 = %33 to %c16 step %34 {
      %subview = memref.subview %29[0, %arg0, %arg1, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
      %subview_1 = memref.subview %28[0, %arg1, %arg0, 0, 0, 0] [%30, 2, 4, 16, 8, 4] [1, 1, 1, 1, 1, 1] : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>> to memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
      scf.for %arg2 = %c0 to %30 step %c1 {
        scf.for %arg3 = %c0 to %c2 step %c1 {
          scf.for %arg4 = %c0 to %c32 step %c1 {
            %35 = affine.apply affine_map<(d0) -> (d0 floordiv 8)>(%arg4)
            %36 = affine.apply affine_map<(d0) -> (d0 mod 8)>(%arg4)
            scf.for %arg5 = %c0 to %c64 step %c1 {
              %37 = affine.apply affine_map<(d0) -> (d0 floordiv 4)>(%arg5)
              %38 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%arg5)
              %39 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c0, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %40 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c1, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %41 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c2, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %42 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c3, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %43 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c4, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %44 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c5, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %45 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c6, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %46 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c7, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
              %subview_2 = memref.subview %alloca[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<8x4xf32>
              vector.store %39, %subview_2[%c0, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %40, %subview_2[%c1, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %41, %subview_2[%c2, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %42, %subview_2[%c3, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %43, %subview_2[%c4, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %44, %subview_2[%c5, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %45, %subview_2[%c6, %c0] : memref<8x4xf32>, vector<4xf32>
              vector.store %46, %subview_2[%c7, %c0] : memref<8x4xf32>, vector<4xf32>
              %subview_3 = memref.subview %alloca[0, 0, %36, %38] [1, 1, 1, 1] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
              %subview_4 = memref.subview %alloca_0[%arg2, %arg3, %arg4, %arg5] [1, 1, 1, 1] [1, 1, 1, 1] : memref<?x2x32x64xf32> to memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
              %47 = memref.load %subview_3[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
              memref.store %47, %subview_4[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
            }
          }
        }
      }
      scf.for %arg2 = %c0 to %30 step %c1 {
        scf.for %arg3 = %c0 to %c4 step %c1 {
          scf.for %arg4 = %c0 to %c2 step %c1 {
            scf.for %arg5 = %c0 to %c64 step %c1 {
              %35 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg3)
              %36 = memref.load %alloca_0[%arg2, %arg4, %35, %arg5] : memref<?x2x32x64xf32>
              %37 = vector.broadcast %36 : f32 to vector<1xf32>
              %38 = affine.apply affine_map<(d0) -> (d0 * 8 + 1)>(%arg3)
              %39 = memref.load %alloca_0[%arg2, %arg4, %38, %arg5] : memref<?x2x32x64xf32>
              %40 = vector.broadcast %39 : f32 to vector<1xf32>
              %41 = affine.apply affine_map<(d0) -> (d0 * 8 + 2)>(%arg3)
              %42 = memref.load %alloca_0[%arg2, %arg4, %41, %arg5] : memref<?x2x32x64xf32>
              %43 = vector.broadcast %42 : f32 to vector<1xf32>
              %44 = affine.apply affine_map<(d0) -> (d0 * 8 + 3)>(%arg3)
              %45 = memref.load %alloca_0[%arg2, %arg4, %44, %arg5] : memref<?x2x32x64xf32>
              %46 = vector.broadcast %45 : f32 to vector<1xf32>
              %47 = affine.apply affine_map<(d0) -> (d0 * 8 + 4)>(%arg3)
              %48 = memref.load %alloca_0[%arg2, %arg4, %47, %arg5] : memref<?x2x32x64xf32>
              %49 = vector.broadcast %48 : f32 to vector<1xf32>
              %50 = affine.apply affine_map<(d0) -> (d0 * 8 + 5)>(%arg3)
              %51 = memref.load %alloca_0[%arg2, %arg4, %50, %arg5] : memref<?x2x32x64xf32>
              %52 = vector.broadcast %51 : f32 to vector<1xf32>
              %53 = affine.apply affine_map<(d0) -> (d0 * 8 + 6)>(%arg3)
              %54 = memref.load %alloca_0[%arg2, %arg4, %53, %arg5] : memref<?x2x32x64xf32>
              %55 = vector.broadcast %54 : f32 to vector<1xf32>
              %56 = affine.apply affine_map<(d0) -> (d0 * 8 + 7)>(%arg3)
              %57 = memref.load %alloca_0[%arg2, %arg4, %56, %arg5] : memref<?x2x32x64xf32>
              %58 = vector.broadcast %57 : f32 to vector<1xf32>
              %59 = vector.insert_strided_slice %37, %cst {offsets = [0], strides = [1]} : vector<1xf32> into vector<8xf32>
              %60 = vector.insert_strided_slice %40, %59 {offsets = [1], strides = [1]} : vector<1xf32> into vector<8xf32>
              %61 = vector.insert_strided_slice %43, %60 {offsets = [2], strides = [1]} : vector<1xf32> into vector<8xf32>
              %62 = vector.insert_strided_slice %46, %61 {offsets = [3], strides = [1]} : vector<1xf32> into vector<8xf32>
              %63 = vector.insert_strided_slice %49, %62 {offsets = [4], strides = [1]} : vector<1xf32> into vector<8xf32>
              %64 = vector.insert_strided_slice %52, %63 {offsets = [5], strides = [1]} : vector<1xf32> into vector<8xf32>
              %65 = vector.insert_strided_slice %55, %64 {offsets = [6], strides = [1]} : vector<1xf32> into vector<8xf32>
              %66 = vector.insert_strided_slice %58, %65 {offsets = [7], strides = [1]} : vector<1xf32> into vector<8xf32>
              %subview_2 = memref.subview %subview[0, 0, 0, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
              vector.store %66, %subview_2[%arg2, %arg3, %arg4, %arg5, %c0] : memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, vector<8xf32>
            }
          }
        }
      }
    }
  }
  return
}

// -----// IR Dump After TranslateTargetExecutableVariantsPass Failed (iree-hal-translate-target-executable-variants) //----- //
hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>) {
  hal.executable.export public @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack" ordinal(0) layout(#hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
  ^bb0(%arg0: !hal.device, %arg1: index, %arg2: index, %arg3: index):
    %c8 = arith.constant 8 : index
    %c12 = arith.constant 12 : index
    %c1 = arith.constant 1 : index
    hal.return %c8, %c12, %c1 : index, index, index
  }
  builtin.module {
    func.func @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
      %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
      %c7 = arith.constant 7 : index
      %c6 = arith.constant 6 : index
      %c5 = arith.constant 5 : index
      %c3 = arith.constant 3 : index
      %c32_i64 = arith.constant 32 : i64
      %c48 = arith.constant 48 : index
      %c16 = arith.constant 16 : index
      %c0 = arith.constant 0 : index
      %c2 = arith.constant 2 : index
      %c32 = arith.constant 32 : index
      %c64 = arith.constant 64 : index
      %c1 = arith.constant 1 : index
      %c4 = arith.constant 4 : index
      %alloca = memref.alloca() {alignment = 64 : i64} : memref<1x1x8x4xf32>
      %0 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(0) : i32
      %1 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(1) : i32
      %2 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(2) : i32
      %3 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(3) : i32
      %4 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(4) : i32
      %5 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(5) : i32
      %6 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(6) : i32
      %7 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(7) : i32
      %8 = arith.extui %0 : i32 to i64
      %9 = arith.extui %1 : i32 to i64
      %10 = arith.shli %9, %c32_i64 : i64
      %11 = arith.ori %8, %10 : i64
      %12 = arith.index_castui %11 : i64 to index
      %13 = arith.extui %2 : i32 to i64
      %14 = arith.extui %3 : i32 to i64
      %15 = arith.shli %14, %c32_i64 : i64
      %16 = arith.ori %13, %15 : i64
      %17 = arith.index_castui %16 : i64 to index
      %18 = arith.extui %4 : i32 to i64
      %19 = arith.extui %5 : i32 to i64
      %20 = arith.shli %19, %c32_i64 : i64
      %21 = arith.ori %18, %20 : i64
      %22 = arith.index_castui %21 : i64 to index
      %23 = arith.extui %6 : i32 to i64
      %24 = arith.extui %7 : i32 to i64
      %25 = arith.shli %24, %c32_i64 : i64
      %26 = arith.ori %23, %25 : i64
      %27 = arith.index_castui %26 : i64 to index
      %28 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%12) flags("ReadOnly|Indirect") : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>{%27}
      memref.assume_alignment %28, 1 : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
      %29 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%17) flags(Indirect) : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>{%27}
      memref.assume_alignment %29, 1 : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
      %30 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%22]
      %workgroup_id_x = hal.interface.workgroup.id[0] : index
      %workgroup_count_x = hal.interface.workgroup.count[0] : index
      %workgroup_id_y = hal.interface.workgroup.id[1] : index
      %workgroup_count_y = hal.interface.workgroup.count[1] : index
      %31 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
      %32 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
      %33 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
      %34 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
      %alloca_0 = memref.alloca(%30) {alignment = 64 : i64} : memref<?x2x32x64xf32>
      scf.for %arg0 = %31 to %c48 step %32 {
        scf.for %arg1 = %33 to %c16 step %34 {
          %subview = memref.subview %29[0, %arg0, %arg1, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
          %subview_1 = memref.subview %28[0, %arg1, %arg0, 0, 0, 0] [%30, 2, 4, 16, 8, 4] [1, 1, 1, 1, 1, 1] : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>> to memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
          scf.for %arg2 = %c0 to %30 step %c1 {
            scf.for %arg3 = %c0 to %c2 step %c1 {
              scf.for %arg4 = %c0 to %c32 step %c1 {
                %35 = affine.apply affine_map<(d0) -> (d0 floordiv 8)>(%arg4)
                %36 = affine.apply affine_map<(d0) -> (d0 mod 8)>(%arg4)
                scf.for %arg5 = %c0 to %c64 step %c1 {
                  %37 = affine.apply affine_map<(d0) -> (d0 floordiv 4)>(%arg5)
                  %38 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%arg5)
                  %39 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c0, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %40 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c1, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %41 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c2, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %42 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c3, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %43 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c4, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %44 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c5, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %45 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c6, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %46 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c7, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                  %subview_2 = memref.subview %alloca[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<8x4xf32>
                  vector.store %39, %subview_2[%c0, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %40, %subview_2[%c1, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %41, %subview_2[%c2, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %42, %subview_2[%c3, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %43, %subview_2[%c4, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %44, %subview_2[%c5, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %45, %subview_2[%c6, %c0] : memref<8x4xf32>, vector<4xf32>
                  vector.store %46, %subview_2[%c7, %c0] : memref<8x4xf32>, vector<4xf32>
                  %subview_3 = memref.subview %alloca[0, 0, %36, %38] [1, 1, 1, 1] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
                  %subview_4 = memref.subview %alloca_0[%arg2, %arg3, %arg4, %arg5] [1, 1, 1, 1] [1, 1, 1, 1] : memref<?x2x32x64xf32> to memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
                  %47 = memref.load %subview_3[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
                  memref.store %47, %subview_4[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
                }
              }
            }
          }
          scf.for %arg2 = %c0 to %30 step %c1 {
            scf.for %arg3 = %c0 to %c4 step %c1 {
              scf.for %arg4 = %c0 to %c2 step %c1 {
                scf.for %arg5 = %c0 to %c64 step %c1 {
                  %35 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg3)
                  %36 = memref.load %alloca_0[%arg2, %arg4, %35, %arg5] : memref<?x2x32x64xf32>
                  %37 = vector.broadcast %36 : f32 to vector<1xf32>
                  %38 = affine.apply affine_map<(d0) -> (d0 * 8 + 1)>(%arg3)
                  %39 = memref.load %alloca_0[%arg2, %arg4, %38, %arg5] : memref<?x2x32x64xf32>
                  %40 = vector.broadcast %39 : f32 to vector<1xf32>
                  %41 = affine.apply affine_map<(d0) -> (d0 * 8 + 2)>(%arg3)
                  %42 = memref.load %alloca_0[%arg2, %arg4, %41, %arg5] : memref<?x2x32x64xf32>
                  %43 = vector.broadcast %42 : f32 to vector<1xf32>
                  %44 = affine.apply affine_map<(d0) -> (d0 * 8 + 3)>(%arg3)
                  %45 = memref.load %alloca_0[%arg2, %arg4, %44, %arg5] : memref<?x2x32x64xf32>
                  %46 = vector.broadcast %45 : f32 to vector<1xf32>
                  %47 = affine.apply affine_map<(d0) -> (d0 * 8 + 4)>(%arg3)
                  %48 = memref.load %alloca_0[%arg2, %arg4, %47, %arg5] : memref<?x2x32x64xf32>
                  %49 = vector.broadcast %48 : f32 to vector<1xf32>
                  %50 = affine.apply affine_map<(d0) -> (d0 * 8 + 5)>(%arg3)
                  %51 = memref.load %alloca_0[%arg2, %arg4, %50, %arg5] : memref<?x2x32x64xf32>
                  %52 = vector.broadcast %51 : f32 to vector<1xf32>
                  %53 = affine.apply affine_map<(d0) -> (d0 * 8 + 6)>(%arg3)
                  %54 = memref.load %alloca_0[%arg2, %arg4, %53, %arg5] : memref<?x2x32x64xf32>
                  %55 = vector.broadcast %54 : f32 to vector<1xf32>
                  %56 = affine.apply affine_map<(d0) -> (d0 * 8 + 7)>(%arg3)
                  %57 = memref.load %alloca_0[%arg2, %arg4, %56, %arg5] : memref<?x2x32x64xf32>
                  %58 = vector.broadcast %57 : f32 to vector<1xf32>
                  %59 = vector.insert_strided_slice %37, %cst {offsets = [0], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %60 = vector.insert_strided_slice %40, %59 {offsets = [1], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %61 = vector.insert_strided_slice %43, %60 {offsets = [2], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %62 = vector.insert_strided_slice %46, %61 {offsets = [3], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %63 = vector.insert_strided_slice %49, %62 {offsets = [4], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %64 = vector.insert_strided_slice %52, %63 {offsets = [5], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %65 = vector.insert_strided_slice %55, %64 {offsets = [6], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %66 = vector.insert_strided_slice %58, %65 {offsets = [7], strides = [1]} : vector<1xf32> into vector<8xf32>
                  %subview_2 = memref.subview %subview[0, 0, 0, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
                  vector.store %66, %subview_2[%arg2, %arg3, %arg4, %arg5, %c0] : memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, vector<8xf32>
                }
              }
            }
          }
        }
      }
      return
    }
  }
}

failed to translate executables
// -----// IR Dump After TranslateExecutablesPass Failed (iree-hal-translate-executables) //----- //
hal.executable private @"torch-jit-export$async_dispatch_3" {
  hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>) {
    hal.executable.export public @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack" ordinal(0) layout(#hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
    ^bb0(%arg0: !hal.device, %arg1: index, %arg2: index, %arg3: index):
      %c8 = arith.constant 8 : index
      %c12 = arith.constant 12 : index
      %c1 = arith.constant 1 : index
      hal.return %c8, %c12, %c1 : index, index, index
    }
    builtin.module {
      func.func @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
        %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
        %c7 = arith.constant 7 : index
        %c6 = arith.constant 6 : index
        %c5 = arith.constant 5 : index
        %c3 = arith.constant 3 : index
        %c32_i64 = arith.constant 32 : i64
        %c48 = arith.constant 48 : index
        %c16 = arith.constant 16 : index
        %c0 = arith.constant 0 : index
        %c2 = arith.constant 2 : index
        %c32 = arith.constant 32 : index
        %c64 = arith.constant 64 : index
        %c1 = arith.constant 1 : index
        %c4 = arith.constant 4 : index
        %alloca = memref.alloca() {alignment = 64 : i64} : memref<1x1x8x4xf32>
        %0 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(0) : i32
        %1 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(1) : i32
        %2 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(2) : i32
        %3 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(3) : i32
        %4 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(4) : i32
        %5 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(5) : i32
        %6 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(6) : i32
        %7 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(7) : i32
        %8 = arith.extui %0 : i32 to i64
        %9 = arith.extui %1 : i32 to i64
        %10 = arith.shli %9, %c32_i64 : i64
        %11 = arith.ori %8, %10 : i64
        %12 = arith.index_castui %11 : i64 to index
        %13 = arith.extui %2 : i32 to i64
        %14 = arith.extui %3 : i32 to i64
        %15 = arith.shli %14, %c32_i64 : i64
        %16 = arith.ori %13, %15 : i64
        %17 = arith.index_castui %16 : i64 to index
        %18 = arith.extui %4 : i32 to i64
        %19 = arith.extui %5 : i32 to i64
        %20 = arith.shli %19, %c32_i64 : i64
        %21 = arith.ori %18, %20 : i64
        %22 = arith.index_castui %21 : i64 to index
        %23 = arith.extui %6 : i32 to i64
        %24 = arith.extui %7 : i32 to i64
        %25 = arith.shli %24, %c32_i64 : i64
        %26 = arith.ori %23, %25 : i64
        %27 = arith.index_castui %26 : i64 to index
        %28 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%12) flags("ReadOnly|Indirect") : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>{%27}
        memref.assume_alignment %28, 1 : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
        %29 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%17) flags(Indirect) : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>{%27}
        memref.assume_alignment %29, 1 : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
        %30 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%22]
        %workgroup_id_x = hal.interface.workgroup.id[0] : index
        %workgroup_count_x = hal.interface.workgroup.count[0] : index
        %workgroup_id_y = hal.interface.workgroup.id[1] : index
        %workgroup_count_y = hal.interface.workgroup.count[1] : index
        %31 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
        %32 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
        %33 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
        %34 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
        %alloca_0 = memref.alloca(%30) {alignment = 64 : i64} : memref<?x2x32x64xf32>
        scf.for %arg0 = %31 to %c48 step %32 {
          scf.for %arg1 = %33 to %c16 step %34 {
            %subview = memref.subview %29[0, %arg0, %arg1, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
            %subview_1 = memref.subview %28[0, %arg1, %arg0, 0, 0, 0] [%30, 2, 4, 16, 8, 4] [1, 1, 1, 1, 1, 1] : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>> to memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
            scf.for %arg2 = %c0 to %30 step %c1 {
              scf.for %arg3 = %c0 to %c2 step %c1 {
                scf.for %arg4 = %c0 to %c32 step %c1 {
                  %35 = affine.apply affine_map<(d0) -> (d0 floordiv 8)>(%arg4)
                  %36 = affine.apply affine_map<(d0) -> (d0 mod 8)>(%arg4)
                  scf.for %arg5 = %c0 to %c64 step %c1 {
                    %37 = affine.apply affine_map<(d0) -> (d0 floordiv 4)>(%arg5)
                    %38 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%arg5)
                    %39 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c0, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %40 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c1, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %41 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c2, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %42 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c3, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %43 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c4, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %44 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c5, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %45 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c6, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %46 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c7, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
                    %subview_2 = memref.subview %alloca[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<8x4xf32>
                    vector.store %39, %subview_2[%c0, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %40, %subview_2[%c1, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %41, %subview_2[%c2, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %42, %subview_2[%c3, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %43, %subview_2[%c4, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %44, %subview_2[%c5, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %45, %subview_2[%c6, %c0] : memref<8x4xf32>, vector<4xf32>
                    vector.store %46, %subview_2[%c7, %c0] : memref<8x4xf32>, vector<4xf32>
                    %subview_3 = memref.subview %alloca[0, 0, %36, %38] [1, 1, 1, 1] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
                    %subview_4 = memref.subview %alloca_0[%arg2, %arg3, %arg4, %arg5] [1, 1, 1, 1] [1, 1, 1, 1] : memref<?x2x32x64xf32> to memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
                    %47 = memref.load %subview_3[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
                    memref.store %47, %subview_4[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
                  }
                }
              }
            }
            scf.for %arg2 = %c0 to %30 step %c1 {
              scf.for %arg3 = %c0 to %c4 step %c1 {
                scf.for %arg4 = %c0 to %c2 step %c1 {
                  scf.for %arg5 = %c0 to %c64 step %c1 {
                    %35 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg3)
                    %36 = memref.load %alloca_0[%arg2, %arg4, %35, %arg5] : memref<?x2x32x64xf32>
                    %37 = vector.broadcast %36 : f32 to vector<1xf32>
                    %38 = affine.apply affine_map<(d0) -> (d0 * 8 + 1)>(%arg3)
                    %39 = memref.load %alloca_0[%arg2, %arg4, %38, %arg5] : memref<?x2x32x64xf32>
                    %40 = vector.broadcast %39 : f32 to vector<1xf32>
                    %41 = affine.apply affine_map<(d0) -> (d0 * 8 + 2)>(%arg3)
                    %42 = memref.load %alloca_0[%arg2, %arg4, %41, %arg5] : memref<?x2x32x64xf32>
                    %43 = vector.broadcast %42 : f32 to vector<1xf32>
                    %44 = affine.apply affine_map<(d0) -> (d0 * 8 + 3)>(%arg3)
                    %45 = memref.load %alloca_0[%arg2, %arg4, %44, %arg5] : memref<?x2x32x64xf32>
                    %46 = vector.broadcast %45 : f32 to vector<1xf32>
                    %47 = affine.apply affine_map<(d0) -> (d0 * 8 + 4)>(%arg3)
                    %48 = memref.load %alloca_0[%arg2, %arg4, %47, %arg5] : memref<?x2x32x64xf32>
                    %49 = vector.broadcast %48 : f32 to vector<1xf32>
                    %50 = affine.apply affine_map<(d0) -> (d0 * 8 + 5)>(%arg3)
                    %51 = memref.load %alloca_0[%arg2, %arg4, %50, %arg5] : memref<?x2x32x64xf32>
                    %52 = vector.broadcast %51 : f32 to vector<1xf32>
                    %53 = affine.apply affine_map<(d0) -> (d0 * 8 + 6)>(%arg3)
                    %54 = memref.load %alloca_0[%arg2, %arg4, %53, %arg5] : memref<?x2x32x64xf32>
                    %55 = vector.broadcast %54 : f32 to vector<1xf32>
                    %56 = affine.apply affine_map<(d0) -> (d0 * 8 + 7)>(%arg3)
                    %57 = memref.load %alloca_0[%arg2, %arg4, %56, %arg5] : memref<?x2x32x64xf32>
                    %58 = vector.broadcast %57 : f32 to vector<1xf32>
                    %59 = vector.insert_strided_slice %37, %cst {offsets = [0], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %60 = vector.insert_strided_slice %40, %59 {offsets = [1], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %61 = vector.insert_strided_slice %43, %60 {offsets = [2], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %62 = vector.insert_strided_slice %46, %61 {offsets = [3], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %63 = vector.insert_strided_slice %49, %62 {offsets = [4], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %64 = vector.insert_strided_slice %52, %63 {offsets = [5], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %65 = vector.insert_strided_slice %55, %64 {offsets = [6], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %66 = vector.insert_strided_slice %58, %65 {offsets = [7], strides = [1]} : vector<1xf32> into vector<8xf32>
                    %subview_2 = memref.subview %subview[0, 0, 0, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
                    vector.store %66, %subview_2[%arg2, %arg3, %arg4, %arg5, %c0] : memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, vector<8xf32>
                  }
                }
              }
            }
          }
        }
        return
      }
    }
  }
}

model.torch_onnx.mlir:3:12: error: 'memref.alloca' op expected no unbounded stack allocations
    %512 = torch.operator "onnx.MatMul"(%arg4, %arg3) : (!torch.vtensor<[?,16,384,384],f32>, !torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,16,384,64],f32> 
           ^
model.torch_onnx.mlir:3:12: note: see current operation: %54 = "memref.alloca"(%45) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 1, 0>}> : (index) -> memref<?x2x32x64xf32>
model.torch_onnx.mlir:17:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
    %530 = torch.operator "onnx.MatMul"(%528, %arg5) : (!torch.vtensor<[?,384,1024],f32>, !torch.vtensor<[1024,1024],f32>) -> !torch.vtensor<[?,384,1024],f32> 
           ^
model.torch_onnx.mlir:17:12: note: see current operation: 
"hal.executable.variant"() ({
  "hal.executable.export"() ({
  ^bb0(%arg10: !hal.device, %arg11: index, %arg12: index, %arg13: index):
    %106 = "arith.constant"() <{value = 8 : index}> : () -> index
    %107 = "arith.constant"() <{value = 12 : index}> : () -> index
    %108 = "arith.constant"() <{value = 1 : index}> : () -> index
    "hal.return"(%106, %107, %108) : (index, index, index) -> ()
  }) {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>], layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 0 : index, sym_name = "torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"} : () -> ()
  "builtin.module"() ({
    "func.func"() <{function_type = () -> (), sym_name = "torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"}> ({
      %0 = "arith.constant"() <{value = dense<0.000000e+00> : vector<8xf32>}> : () -> vector<8xf32>
      %1 = "arith.constant"() <{value = 7 : index}> : () -> index
      %2 = "arith.constant"() <{value = 6 : index}> : () -> index
      %3 = "arith.constant"() <{value = 5 : index}> : () -> index
      %4 = "arith.constant"() <{value = 3 : index}> : () -> index
      %5 = "arith.constant"() <{value = 32 : i64}> : () -> i64
      %6 = "arith.constant"() <{value = 48 : index}> : () -> index
      %7 = "arith.constant"() <{value = 16 : index}> : () -> index
      %8 = "arith.constant"() <{value = 0 : index}> : () -> index
      %9 = "arith.constant"() <{value = 2 : index}> : () -> index
      %10 = "arith.constant"() <{value = 32 : index}> : () -> index
      %11 = "arith.constant"() <{value = 64 : index}> : () -> index
      %12 = "arith.constant"() <{value = 1 : index}> : () -> index
      %13 = "arith.constant"() <{value = 4 : index}> : () -> index
      %14 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}> : () -> memref<1x1x8x4xf32>
      %15 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 0 : index} : () -> i32
      %16 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 1 : index} : () -> i32
      %17 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 2 : index} : () -> i32
      %18 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 3 : index} : () -> i32
      %19 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 4 : index} : () -> i32
      %20 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 5 : index} : () -> i32
      %21 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 6 : index} : () -> i32
      %22 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 7 : index} : () -> i32
      %23 = "arith.extui"(%15) : (i32) -> i64
      %24 = "arith.extui"(%16) : (i32) -> i64
      %25 = "arith.shli"(%24, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
      %26 = "arith.ori"(%23, %25) : (i64, i64) -> i64
      %27 = "arith.index_castui"(%26) : (i64) -> index
      %28 = "arith.extui"(%17) : (i32) -> i64
      %29 = "arith.extui"(%18) : (i32) -> i64
      %30 = "arith.shli"(%29, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
      %31 = "arith.ori"(%28, %30) : (i64, i64) -> i64
      %32 = "arith.index_castui"(%31) : (i64) -> index
      %33 = "arith.extui"(%19) : (i32) -> i64
      %34 = "arith.extui"(%20) : (i32) -> i64
      %35 = "arith.shli"(%34, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
      %36 = "arith.ori"(%33, %35) : (i64, i64) -> i64
      %37 = "arith.index_castui"(%36) : (i64) -> index
      %38 = "arith.extui"(%21) : (i32) -> i64
      %39 = "arith.extui"(%22) : (i32) -> i64
      %40 = "arith.shli"(%39, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
      %41 = "arith.ori"(%38, %40) : (i64, i64) -> i64
      %42 = "arith.index_castui"(%41) : (i64) -> index
      %43 = "hal.interface.binding.subspan"(%27, %42) {alignment = 64 : index, binding = 0 : index, descriptor_flags = 3 : i32, layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, operandSegmentSizes = array<i32: 1, 1>, set = 0 : index} : (index, index) -> memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
      "memref.assume_alignment"(%43) <{alignment = 1 : i32}> : (memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>) -> ()
      %44 = "hal.interface.binding.subspan"(%32, %42) {alignment = 64 : index, binding = 1 : index, descriptor_flags = 2 : i32, layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, operandSegmentSizes = array<i32: 1, 1>, set = 0 : index} : (index, index) -> memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
      "memref.assume_alignment"(%44) <{alignment = 1 : i32}> : (memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>) -> ()
      %45 = "affine.apply"(%37) <{map = affine_map<()[s0] -> (s0 floordiv 16)>}> : (index) -> index
      %46 = "hal.interface.workgroup.id"() {dimension = 0 : index} : () -> index
      %47 = "hal.interface.workgroup.count"() {dimension = 0 : index} : () -> index
      %48 = "hal.interface.workgroup.id"() {dimension = 1 : index} : () -> index
      %49 = "hal.interface.workgroup.count"() {dimension = 1 : index} : () -> index
      %50 = "affine.apply"(%48) <{map = affine_map<()[s0] -> (s0 * 4)>}> : (index) -> index
      %51 = "affine.apply"(%49) <{map = affine_map<()[s0] -> (s0 * 4)>}> : (index) -> index
      %52 = "affine.apply"(%46) <{map = affine_map<()[s0] -> (s0 * 2)>}> : (index) -> index
      %53 = "affine.apply"(%47) <{map = affine_map<()[s0] -> (s0 * 2)>}> : (index) -> index
      %54 = "memref.alloca"(%45) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 1, 0>}> : (index) -> memref<?x2x32x64xf32>
      "scf.for"(%50, %6, %51) ({
      ^bb0(%arg0: index):
        "scf.for"(%52, %7, %53) ({
        ^bb0(%arg1: index):
          %55 = "memref.subview"(%44, %arg0, %arg1, %45) <{operandSegmentSizes = array<i32: 1, 2, 1, 0>, static_offsets = array<i64: 0, -9223372036854775808, -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: -9223372036854775808, 4, 2, 64, 8, 1>, static_strides = array<i64: 1, 1, 1, 1, 1, 1>}> : (memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>, index, index, index) -> memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
          %56 = "memref.subview"(%43, %arg1, %arg0, %45) <{operandSegmentSizes = array<i32: 1, 2, 1, 0>, static_offsets = array<i64: 0, -9223372036854775808, -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: -9223372036854775808, 2, 4, 16, 8, 4>, static_strides = array<i64: 1, 1, 1, 1, 1, 1>}> : (memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index) -> memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
          "scf.for"(%8, %45, %12) ({
          ^bb0(%arg6: index):
            "scf.for"(%8, %9, %12) ({
            ^bb0(%arg7: index):
              "scf.for"(%8, %10, %12) ({
              ^bb0(%arg8: index):
                %90 = "affine.apply"(%arg8) <{map = affine_map<(d0) -> (d0 floordiv 8)>}> : (index) -> index
                %91 = "affine.apply"(%arg8) <{map = affine_map<(d0) -> (d0 mod 8)>}> : (index) -> index
                "scf.for"(%8, %11, %12) ({
                ^bb0(%arg9: index):
                  %92 = "affine.apply"(%arg9) <{map = affine_map<(d0) -> (d0 floordiv 4)>}> : (index) -> index
                  %93 = "affine.apply"(%arg9) <{map = affine_map<(d0) -> (d0 mod 4)>}> : (index) -> index
                  %94 = "vector.load"(%56, %arg6, %arg7, %90, %92, %8, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %95 = "vector.load"(%56, %arg6, %arg7, %90, %92, %12, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %96 = "vector.load"(%56, %arg6, %arg7, %90, %92, %9, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %97 = "vector.load"(%56, %arg6, %arg7, %90, %92, %4, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %98 = "vector.load"(%56, %arg6, %arg7, %90, %92, %13, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %99 = "vector.load"(%56, %arg6, %arg7, %90, %92, %3, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %100 = "vector.load"(%56, %arg6, %arg7, %90, %92, %2, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %101 = "vector.load"(%56, %arg6, %arg7, %90, %92, %1, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
                  %102 = "memref.subview"(%14) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 1, 8, 4>, static_strides = array<i64: 1, 1, 1, 1>}> : (memref<1x1x8x4xf32>) -> memref<8x4xf32>
                  "vector.store"(%94, %102, %8, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%95, %102, %12, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%96, %102, %9, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%97, %102, %4, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%98, %102, %13, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%99, %102, %3, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%100, %102, %2, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  "vector.store"(%101, %102, %1, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
                  %103 = "memref.subview"(%14, %91, %93) <{operandSegmentSizes = array<i32: 1, 2, 0, 0>, static_offsets = array<i64: 0, 0, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 1, 1, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (memref<1x1x8x4xf32>, index, index) -> memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
                  %104 = "memref.subview"(%54, %arg6, %arg7, %arg8, %arg9) <{operandSegmentSizes = array<i32: 1, 4, 0, 0>, static_offsets = array<i64: -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 1, 1, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (memref<?x2x32x64xf32>, index, index, index, index) -> memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
                  %105 = "memref.load"(%103, %8, %8, %8, %8) <{nontemporal = false}> : (memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>, index, index, index, index) -> f32
                  "memref.store"(%105, %104, %8, %8, %8, %8) <{nontemporal = false}> : (f32, memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>, index, index, index, index) -> ()
                  "scf.yield"() : () -> ()
                }) : (index, index, index) -> ()
                "scf.yield"() : () -> ()
              }) : (index, index, index) -> ()
              "scf.yield"() : () -> ()
            }) : (index, index, index) -> ()
            "scf.yield"() : () -> ()
          }) : (index, index, index) -> ()
          "scf.for"(%8, %45, %12) ({
          ^bb0(%arg2: index):
            "scf.for"(%8, %13, %12) ({
            ^bb0(%arg3: index):
              "scf.for"(%8, %9, %12) ({
              ^bb0(%arg4: index):
                "scf.for"(%8, %11, %12) ({
                ^bb0(%arg5: index):
                  %57 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8)>}> : (index) -> index
                  %58 = "memref.load"(%54, %arg2, %arg4, %57, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %59 = "vector.broadcast"(%58) : (f32) -> vector<1xf32>
                  %60 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 1)>}> : (index) -> index
                  %61 = "memref.load"(%54, %arg2, %arg4, %60, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %62 = "vector.broadcast"(%61) : (f32) -> vector<1xf32>
                  %63 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 2)>}> : (index) -> index
                  %64 = "memref.load"(%54, %arg2, %arg4, %63, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %65 = "vector.broadcast"(%64) : (f32) -> vector<1xf32>
                  %66 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 3)>}> : (index) -> index
                  %67 = "memref.load"(%54, %arg2, %arg4, %66, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %68 = "vector.broadcast"(%67) : (f32) -> vector<1xf32>
                  %69 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 4)>}> : (index) -> index
                  %70 = "memref.load"(%54, %arg2, %arg4, %69, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %71 = "vector.broadcast"(%70) : (f32) -> vector<1xf32>
                  %72 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 5)>}> : (index) -> index
                  %73 = "memref.load"(%54, %arg2, %arg4, %72, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %74 = "vector.broadcast"(%73) : (f32) -> vector<1xf32>
                  %75 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 6)>}> : (index) -> index
                  %76 = "memref.load"(%54, %arg2, %arg4, %75, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %77 = "vector.broadcast"(%76) : (f32) -> vector<1xf32>
                  %78 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 7)>}> : (index) -> index
                  %79 = "memref.load"(%54, %arg2, %arg4, %78, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
                  %80 = "vector.broadcast"(%79) : (f32) -> vector<1xf32>
                  %81 = "vector.insert_strided_slice"(%59, %0) <{offsets = [0], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %82 = "vector.insert_strided_slice"(%62, %81) <{offsets = [1], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %83 = "vector.insert_strided_slice"(%65, %82) <{offsets = [2], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %84 = "vector.insert_strided_slice"(%68, %83) <{offsets = [3], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %85 = "vector.insert_strided_slice"(%71, %84) <{offsets = [4], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %86 = "vector.insert_strided_slice"(%74, %85) <{offsets = [5], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %87 = "vector.insert_strided_slice"(%77, %86) <{offsets = [6], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %88 = "vector.insert_strided_slice"(%80, %87) <{offsets = [7], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
                  %89 = "memref.subview"(%55, %45) <{operandSegmentSizes = array<i32: 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0, 0, 0>, static_sizes = array<i64: -9223372036854775808, 4, 2, 64, 8, 1>, static_strides = array<i64: 1, 1, 1, 1, 1, 1>}> : (memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>, index) -> memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
                  "vector.store"(%88, %89, %arg2, %arg3, %arg4, %arg5, %8) <{nontemporal = false}> : (vector<8xf32>, memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, index, index, index, index, index) -> ()
                  "scf.yield"() : () -> ()
                }) : (index, index, index) -> ()
                "scf.yield"() : () -> ()
              }) : (index, index, index) -> ()
              "scf.yield"() : () -> ()
            }) : (index, index, index) -> ()
            "scf.yield"() : () -> ()
          }) : (index, index, index) -> ()
          "scf.yield"() : () -> ()
        }) : (index, index, index) -> ()
        "scf.yield"() : () -> ()
      }) : (index, index, index) -> ()
      "func.return"() : () -> ()
    }) {translation_info = #iree_codegen.translation_info<CPUDataTiling>} : () -> ()
  }) : () -> ()
  "hal.executable.variant_end"() : () -> ()
}) {sym_name = "embedded_elf_x86_64", target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>} : () -> ()

Steps to reproduce your issue

command to reproduce the issue:

iree-compile model.torch_onnx.mlir --iree-hal-target-backends=llvm-cpu --iree-input-demote-i64-to-i32

IREE version: IREE compiler version 20240819.990 @ aeda14995f16ed1302db616adf0c03acf80f27ee LLVM version 20.0.0git

What component(s) does this issue relate to?

Compiler

Version information

No response

Additional context

No response

Aug 20 '24 10:08 pdhirajkumarprasad

@lialan if you are looking at this please attach the IR with dump after all. I can redirect appropriately

Aug 20 '24 19:08 MaheshRavishankar

Problematic memref.alloca:

      %54 = "memref.alloca"(%45) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 1, 0>}> : (index) -> memref<?x2x32x64xf32>

contains dynamic shape. the dynamic size: %45 was defined as:

      %45 = "affine.apply"(%37) <{map = affine_map<()[s0] -> (s0 floordiv 16)>}> : (index) -> index

again the value cannot be simply simplified, hence the check for alloca size bound failed.

@MaheshRavishankar the LLVM IR is attached in @pdhirajkumarprasad 's comment, just grep the keywords in this page.

Aug 21 '24 04:08 lialan

Thats not enought to see what is going on. I see the alloca is dynamic, but need to see IR dump after all to really understand... for all such bug reports it will be easier for me to redirect if I can just get the dump with --mlir-print-ir-after-all --mlir-print-ir-before-all --mlir-disable-threading --mlir-elide-elementsattrs-if-larger=4 .

Aug 21 '24 21:08 MaheshRavishankar

@MaheshRavishankar see attached: 18297.mlir.txt

Aug 22 '24 03:08 lialan

Seems like we are missing a folder for unpack -> pack

  %34 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%32]
  %35 = tensor.empty(%34) : tensor<?x16x384x64xf32>
  %unpack = tensor.unpack %33 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %35 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 4, 2, 64], [1, 1, 1, 1], [0, 0, 0, 0], [0, 0, 0, 0]]>} : tensor<?x16x48x16x8x4xf32> -> tensor<?x16x384x64xf32>
  %36 = tensor.empty(%34) : tensor<?x48x16x64x8x1xf32>
  %pack = tensor.pack %unpack outer_dims_perm = [0, 2, 1, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 1] into %36 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 4, 2, 64], [1, 1, 1, 1]]>} : tensor<?x16x384x64xf32> -> tensor<?x48x16x64x8x1xf32>

This is just a transpose AFAICS.

Aug 22 '24 20:08 MaheshRavishankar

@pashu123 could you fix this?

Aug 22 '24 20:08 MaheshRavishankar

Confirm the issue is gone along with #18296 in latest main branch.

Aug 23 '24 17:08 lialan

Moving this out of CPU project board.

@pashu123 still need to fuse pack+unpack, per @MaheshRavishankar

Aug 23 '24 17:08 lialan

Is this failing again on main? Last check was that this isnt failing on mai.

Sep 05 '24 22:09 MaheshRavishankar

Is this failing again on main? Last check was that this isnt failing on mai.

I verified this is working on the main. We can close this and create a new issue for the pack+unpack folding.

Sep 06 '24 16:09 pashu123

Verified, issue don't exist in latest build

Sep 13 '24 04:09 pdhirajkumarprasad