[compile][cpu]:error: 'memref.alloca' op expected no unbounded stack allocations
What happened?
for the given IR
module {
func.func @"torch-jit-export"(%arg0: !torch.vtensor<[?,384],si64>, %arg1: !torch.vtensor<[?,384],si64>, %arg2: !torch.vtensor<[?,384],si64>, %arg3:!torch.vtensor<[?,16,384,64],f32>, %arg4:!torch.vtensor<[?,16,384,384],f32>, %arg5:!torch.vtensor<[1024,1024],f32>, %arg6:!torch.vtensor<[1024],f32>, %arg7:!torch.vtensor<[1],si64>,%arg8:!torch.vtensor<[3],si64>) -> !torch.vtensor<[?,384,1024],f32> attributes {torch.onnx_meta.ir_version = 4 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.producer_name = "pytorch", torch.onnx_meta.producer_version = "1.3"} {
%512 = torch.operator "onnx.MatMul"(%arg4, %arg3) : (!torch.vtensor<[?,16,384,384],f32>, !torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,16,384,64],f32>
%513 = torch.operator "onnx.Transpose"(%512) {torch.onnx.perm = [0 : si64, 2 : si64, 1 : si64, 3 : si64]} : (!torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,384,16,64],f32>
%514 = torch.operator "onnx.Shape"(%513) : (!torch.vtensor<[?,384,16,64],f32>) -> !torch.vtensor<[4],si64>
%515 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<__21> : tensor<si64>} : () -> !torch.vtensor<[],si64>
%516 = torch.operator "onnx.Gather"(%514, %515) {torch.onnx.axis = 0 : si64} : (!torch.vtensor<[4],si64>, !torch.vtensor<[],si64>) -> !torch.vtensor<[],si64>
%517 = torch.operator "onnx.Shape"(%513) : (!torch.vtensor<[?,384,16,64],f32>) -> !torch.vtensor<[4],si64>
%518 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<__22> : tensor<si64>} : () -> !torch.vtensor<[],si64>
%519 = torch.operator "onnx.Gather"(%517, %518) {torch.onnx.axis = 0 : si64} : (!torch.vtensor<[4],si64>, !torch.vtensor<[],si64>) -> !torch.vtensor<[],si64>
%520 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<__23> : tensor<si64>} : () -> !torch.vtensor<[],si64>
%522 = torch.operator "onnx.Unsqueeze"(%516, %arg7) : (!torch.vtensor<[],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1],si64>
%524 = torch.operator "onnx.Unsqueeze"(%519, %arg7) : (!torch.vtensor<[],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1],si64>
%526 = torch.operator "onnx.Unsqueeze"(%520, %arg7) : (!torch.vtensor<[],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1],si64>
%527 = torch.operator "onnx.Concat"(%522, %524, %526) {torch.onnx.axis = 0 : si64} : (!torch.vtensor<[1],si64>, !torch.vtensor<[1],si64>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[3],si64>
%528 = torch.operator "onnx.Reshape"(%513, %527) : (!torch.vtensor<[?,384,16,64],f32>, !torch.vtensor<[3],si64>) -> !torch.vtensor<[?,384,1024],f32>
%530 = torch.operator "onnx.MatMul"(%528, %arg5) : (!torch.vtensor<[?,384,1024],f32>, !torch.vtensor<[1024,1024],f32>) -> !torch.vtensor<[?,384,1024],f32>
%531 = torch.operator "onnx.Add"(%530, %arg6) : (!torch.vtensor<[?,384,1024],f32>, !torch.vtensor<[1024],f32>) -> !torch.vtensor<[?,384,1024],f32>
return %531: !torch.vtensor<[?,384,1024],f32>
}
}
{-#
dialect_resources: {
builtin: {
__21: "0x080000000000000000000000",
__22: "0x080000000100000000000000",
__23: "0x080000000004000000000000"
}
}
#-}
getting error as
model.torch_onnx.mlir:3:12: error: 'memref.alloca' op expected no unbounded stack allocations
%512 = torch.operator "onnx.MatMul"(%arg4, %arg3) : (!torch.vtensor<[?,16,384,384],f32>, !torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,16,384,64],f32>
IR after failure:
// -----// IR Dump After LLVMCPUCheckIRBeforeLLVMConversionPass Failed (iree-llvmcpu-check-ir-before-llvm-conversion) //----- //
func.func @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
%cst = arith.constant dense<0.000000e+00> : vector<8xf32>
%c7 = arith.constant 7 : index
%c6 = arith.constant 6 : index
%c5 = arith.constant 5 : index
%c3 = arith.constant 3 : index
%c32_i64 = arith.constant 32 : i64
%c48 = arith.constant 48 : index
%c16 = arith.constant 16 : index
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c32 = arith.constant 32 : index
%c64 = arith.constant 64 : index
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%alloca = memref.alloca() {alignment = 64 : i64} : memref<1x1x8x4xf32>
%0 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(0) : i32
%1 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(1) : i32
%2 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(2) : i32
%3 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(3) : i32
%4 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(4) : i32
%5 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(5) : i32
%6 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(6) : i32
%7 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(7) : i32
%8 = arith.extui %0 : i32 to i64
%9 = arith.extui %1 : i32 to i64
%10 = arith.shli %9, %c32_i64 : i64
%11 = arith.ori %8, %10 : i64
%12 = arith.index_castui %11 : i64 to index
%13 = arith.extui %2 : i32 to i64
%14 = arith.extui %3 : i32 to i64
%15 = arith.shli %14, %c32_i64 : i64
%16 = arith.ori %13, %15 : i64
%17 = arith.index_castui %16 : i64 to index
%18 = arith.extui %4 : i32 to i64
%19 = arith.extui %5 : i32 to i64
%20 = arith.shli %19, %c32_i64 : i64
%21 = arith.ori %18, %20 : i64
%22 = arith.index_castui %21 : i64 to index
%23 = arith.extui %6 : i32 to i64
%24 = arith.extui %7 : i32 to i64
%25 = arith.shli %24, %c32_i64 : i64
%26 = arith.ori %23, %25 : i64
%27 = arith.index_castui %26 : i64 to index
%28 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%12) flags("ReadOnly|Indirect") : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>{%27}
memref.assume_alignment %28, 1 : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
%29 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%17) flags(Indirect) : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>{%27}
memref.assume_alignment %29, 1 : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%30 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%22]
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%31 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
%32 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
%33 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
%34 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
%alloca_0 = memref.alloca(%30) {alignment = 64 : i64} : memref<?x2x32x64xf32>
scf.for %arg0 = %31 to %c48 step %32 {
scf.for %arg1 = %33 to %c16 step %34 {
%subview = memref.subview %29[0, %arg0, %arg1, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%subview_1 = memref.subview %28[0, %arg1, %arg0, 0, 0, 0] [%30, 2, 4, 16, 8, 4] [1, 1, 1, 1, 1, 1] : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>> to memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
scf.for %arg2 = %c0 to %30 step %c1 {
scf.for %arg3 = %c0 to %c2 step %c1 {
scf.for %arg4 = %c0 to %c32 step %c1 {
%35 = affine.apply affine_map<(d0) -> (d0 floordiv 8)>(%arg4)
%36 = affine.apply affine_map<(d0) -> (d0 mod 8)>(%arg4)
scf.for %arg5 = %c0 to %c64 step %c1 {
%37 = affine.apply affine_map<(d0) -> (d0 floordiv 4)>(%arg5)
%38 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%arg5)
%39 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c0, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%40 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c1, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%41 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c2, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%42 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c3, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%43 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c4, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%44 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c5, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%45 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c6, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%46 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c7, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%subview_2 = memref.subview %alloca[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<8x4xf32>
vector.store %39, %subview_2[%c0, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %40, %subview_2[%c1, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %41, %subview_2[%c2, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %42, %subview_2[%c3, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %43, %subview_2[%c4, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %44, %subview_2[%c5, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %45, %subview_2[%c6, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %46, %subview_2[%c7, %c0] : memref<8x4xf32>, vector<4xf32>
%subview_3 = memref.subview %alloca[0, 0, %36, %38] [1, 1, 1, 1] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
%subview_4 = memref.subview %alloca_0[%arg2, %arg3, %arg4, %arg5] [1, 1, 1, 1] [1, 1, 1, 1] : memref<?x2x32x64xf32> to memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
%47 = memref.load %subview_3[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
memref.store %47, %subview_4[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
}
}
}
}
scf.for %arg2 = %c0 to %30 step %c1 {
scf.for %arg3 = %c0 to %c4 step %c1 {
scf.for %arg4 = %c0 to %c2 step %c1 {
scf.for %arg5 = %c0 to %c64 step %c1 {
%35 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg3)
%36 = memref.load %alloca_0[%arg2, %arg4, %35, %arg5] : memref<?x2x32x64xf32>
%37 = vector.broadcast %36 : f32 to vector<1xf32>
%38 = affine.apply affine_map<(d0) -> (d0 * 8 + 1)>(%arg3)
%39 = memref.load %alloca_0[%arg2, %arg4, %38, %arg5] : memref<?x2x32x64xf32>
%40 = vector.broadcast %39 : f32 to vector<1xf32>
%41 = affine.apply affine_map<(d0) -> (d0 * 8 + 2)>(%arg3)
%42 = memref.load %alloca_0[%arg2, %arg4, %41, %arg5] : memref<?x2x32x64xf32>
%43 = vector.broadcast %42 : f32 to vector<1xf32>
%44 = affine.apply affine_map<(d0) -> (d0 * 8 + 3)>(%arg3)
%45 = memref.load %alloca_0[%arg2, %arg4, %44, %arg5] : memref<?x2x32x64xf32>
%46 = vector.broadcast %45 : f32 to vector<1xf32>
%47 = affine.apply affine_map<(d0) -> (d0 * 8 + 4)>(%arg3)
%48 = memref.load %alloca_0[%arg2, %arg4, %47, %arg5] : memref<?x2x32x64xf32>
%49 = vector.broadcast %48 : f32 to vector<1xf32>
%50 = affine.apply affine_map<(d0) -> (d0 * 8 + 5)>(%arg3)
%51 = memref.load %alloca_0[%arg2, %arg4, %50, %arg5] : memref<?x2x32x64xf32>
%52 = vector.broadcast %51 : f32 to vector<1xf32>
%53 = affine.apply affine_map<(d0) -> (d0 * 8 + 6)>(%arg3)
%54 = memref.load %alloca_0[%arg2, %arg4, %53, %arg5] : memref<?x2x32x64xf32>
%55 = vector.broadcast %54 : f32 to vector<1xf32>
%56 = affine.apply affine_map<(d0) -> (d0 * 8 + 7)>(%arg3)
%57 = memref.load %alloca_0[%arg2, %arg4, %56, %arg5] : memref<?x2x32x64xf32>
%58 = vector.broadcast %57 : f32 to vector<1xf32>
%59 = vector.insert_strided_slice %37, %cst {offsets = [0], strides = [1]} : vector<1xf32> into vector<8xf32>
%60 = vector.insert_strided_slice %40, %59 {offsets = [1], strides = [1]} : vector<1xf32> into vector<8xf32>
%61 = vector.insert_strided_slice %43, %60 {offsets = [2], strides = [1]} : vector<1xf32> into vector<8xf32>
%62 = vector.insert_strided_slice %46, %61 {offsets = [3], strides = [1]} : vector<1xf32> into vector<8xf32>
%63 = vector.insert_strided_slice %49, %62 {offsets = [4], strides = [1]} : vector<1xf32> into vector<8xf32>
%64 = vector.insert_strided_slice %52, %63 {offsets = [5], strides = [1]} : vector<1xf32> into vector<8xf32>
%65 = vector.insert_strided_slice %55, %64 {offsets = [6], strides = [1]} : vector<1xf32> into vector<8xf32>
%66 = vector.insert_strided_slice %58, %65 {offsets = [7], strides = [1]} : vector<1xf32> into vector<8xf32>
%subview_2 = memref.subview %subview[0, 0, 0, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
vector.store %66, %subview_2[%arg2, %arg3, %arg4, %arg5, %c0] : memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, vector<8xf32>
}
}
}
}
}
}
return
}
// -----// IR Dump After TranslateTargetExecutableVariantsPass Failed (iree-hal-translate-target-executable-variants) //----- //
hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>) {
hal.executable.export public @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack" ordinal(0) layout(#hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
^bb0(%arg0: !hal.device, %arg1: index, %arg2: index, %arg3: index):
%c8 = arith.constant 8 : index
%c12 = arith.constant 12 : index
%c1 = arith.constant 1 : index
hal.return %c8, %c12, %c1 : index, index, index
}
builtin.module {
func.func @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
%cst = arith.constant dense<0.000000e+00> : vector<8xf32>
%c7 = arith.constant 7 : index
%c6 = arith.constant 6 : index
%c5 = arith.constant 5 : index
%c3 = arith.constant 3 : index
%c32_i64 = arith.constant 32 : i64
%c48 = arith.constant 48 : index
%c16 = arith.constant 16 : index
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c32 = arith.constant 32 : index
%c64 = arith.constant 64 : index
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%alloca = memref.alloca() {alignment = 64 : i64} : memref<1x1x8x4xf32>
%0 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(0) : i32
%1 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(1) : i32
%2 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(2) : i32
%3 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(3) : i32
%4 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(4) : i32
%5 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(5) : i32
%6 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(6) : i32
%7 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(7) : i32
%8 = arith.extui %0 : i32 to i64
%9 = arith.extui %1 : i32 to i64
%10 = arith.shli %9, %c32_i64 : i64
%11 = arith.ori %8, %10 : i64
%12 = arith.index_castui %11 : i64 to index
%13 = arith.extui %2 : i32 to i64
%14 = arith.extui %3 : i32 to i64
%15 = arith.shli %14, %c32_i64 : i64
%16 = arith.ori %13, %15 : i64
%17 = arith.index_castui %16 : i64 to index
%18 = arith.extui %4 : i32 to i64
%19 = arith.extui %5 : i32 to i64
%20 = arith.shli %19, %c32_i64 : i64
%21 = arith.ori %18, %20 : i64
%22 = arith.index_castui %21 : i64 to index
%23 = arith.extui %6 : i32 to i64
%24 = arith.extui %7 : i32 to i64
%25 = arith.shli %24, %c32_i64 : i64
%26 = arith.ori %23, %25 : i64
%27 = arith.index_castui %26 : i64 to index
%28 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%12) flags("ReadOnly|Indirect") : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>{%27}
memref.assume_alignment %28, 1 : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
%29 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%17) flags(Indirect) : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>{%27}
memref.assume_alignment %29, 1 : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%30 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%22]
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%31 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
%32 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
%33 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
%34 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
%alloca_0 = memref.alloca(%30) {alignment = 64 : i64} : memref<?x2x32x64xf32>
scf.for %arg0 = %31 to %c48 step %32 {
scf.for %arg1 = %33 to %c16 step %34 {
%subview = memref.subview %29[0, %arg0, %arg1, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%subview_1 = memref.subview %28[0, %arg1, %arg0, 0, 0, 0] [%30, 2, 4, 16, 8, 4] [1, 1, 1, 1, 1, 1] : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>> to memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
scf.for %arg2 = %c0 to %30 step %c1 {
scf.for %arg3 = %c0 to %c2 step %c1 {
scf.for %arg4 = %c0 to %c32 step %c1 {
%35 = affine.apply affine_map<(d0) -> (d0 floordiv 8)>(%arg4)
%36 = affine.apply affine_map<(d0) -> (d0 mod 8)>(%arg4)
scf.for %arg5 = %c0 to %c64 step %c1 {
%37 = affine.apply affine_map<(d0) -> (d0 floordiv 4)>(%arg5)
%38 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%arg5)
%39 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c0, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%40 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c1, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%41 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c2, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%42 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c3, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%43 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c4, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%44 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c5, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%45 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c6, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%46 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c7, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%subview_2 = memref.subview %alloca[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<8x4xf32>
vector.store %39, %subview_2[%c0, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %40, %subview_2[%c1, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %41, %subview_2[%c2, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %42, %subview_2[%c3, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %43, %subview_2[%c4, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %44, %subview_2[%c5, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %45, %subview_2[%c6, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %46, %subview_2[%c7, %c0] : memref<8x4xf32>, vector<4xf32>
%subview_3 = memref.subview %alloca[0, 0, %36, %38] [1, 1, 1, 1] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
%subview_4 = memref.subview %alloca_0[%arg2, %arg3, %arg4, %arg5] [1, 1, 1, 1] [1, 1, 1, 1] : memref<?x2x32x64xf32> to memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
%47 = memref.load %subview_3[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
memref.store %47, %subview_4[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
}
}
}
}
scf.for %arg2 = %c0 to %30 step %c1 {
scf.for %arg3 = %c0 to %c4 step %c1 {
scf.for %arg4 = %c0 to %c2 step %c1 {
scf.for %arg5 = %c0 to %c64 step %c1 {
%35 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg3)
%36 = memref.load %alloca_0[%arg2, %arg4, %35, %arg5] : memref<?x2x32x64xf32>
%37 = vector.broadcast %36 : f32 to vector<1xf32>
%38 = affine.apply affine_map<(d0) -> (d0 * 8 + 1)>(%arg3)
%39 = memref.load %alloca_0[%arg2, %arg4, %38, %arg5] : memref<?x2x32x64xf32>
%40 = vector.broadcast %39 : f32 to vector<1xf32>
%41 = affine.apply affine_map<(d0) -> (d0 * 8 + 2)>(%arg3)
%42 = memref.load %alloca_0[%arg2, %arg4, %41, %arg5] : memref<?x2x32x64xf32>
%43 = vector.broadcast %42 : f32 to vector<1xf32>
%44 = affine.apply affine_map<(d0) -> (d0 * 8 + 3)>(%arg3)
%45 = memref.load %alloca_0[%arg2, %arg4, %44, %arg5] : memref<?x2x32x64xf32>
%46 = vector.broadcast %45 : f32 to vector<1xf32>
%47 = affine.apply affine_map<(d0) -> (d0 * 8 + 4)>(%arg3)
%48 = memref.load %alloca_0[%arg2, %arg4, %47, %arg5] : memref<?x2x32x64xf32>
%49 = vector.broadcast %48 : f32 to vector<1xf32>
%50 = affine.apply affine_map<(d0) -> (d0 * 8 + 5)>(%arg3)
%51 = memref.load %alloca_0[%arg2, %arg4, %50, %arg5] : memref<?x2x32x64xf32>
%52 = vector.broadcast %51 : f32 to vector<1xf32>
%53 = affine.apply affine_map<(d0) -> (d0 * 8 + 6)>(%arg3)
%54 = memref.load %alloca_0[%arg2, %arg4, %53, %arg5] : memref<?x2x32x64xf32>
%55 = vector.broadcast %54 : f32 to vector<1xf32>
%56 = affine.apply affine_map<(d0) -> (d0 * 8 + 7)>(%arg3)
%57 = memref.load %alloca_0[%arg2, %arg4, %56, %arg5] : memref<?x2x32x64xf32>
%58 = vector.broadcast %57 : f32 to vector<1xf32>
%59 = vector.insert_strided_slice %37, %cst {offsets = [0], strides = [1]} : vector<1xf32> into vector<8xf32>
%60 = vector.insert_strided_slice %40, %59 {offsets = [1], strides = [1]} : vector<1xf32> into vector<8xf32>
%61 = vector.insert_strided_slice %43, %60 {offsets = [2], strides = [1]} : vector<1xf32> into vector<8xf32>
%62 = vector.insert_strided_slice %46, %61 {offsets = [3], strides = [1]} : vector<1xf32> into vector<8xf32>
%63 = vector.insert_strided_slice %49, %62 {offsets = [4], strides = [1]} : vector<1xf32> into vector<8xf32>
%64 = vector.insert_strided_slice %52, %63 {offsets = [5], strides = [1]} : vector<1xf32> into vector<8xf32>
%65 = vector.insert_strided_slice %55, %64 {offsets = [6], strides = [1]} : vector<1xf32> into vector<8xf32>
%66 = vector.insert_strided_slice %58, %65 {offsets = [7], strides = [1]} : vector<1xf32> into vector<8xf32>
%subview_2 = memref.subview %subview[0, 0, 0, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
vector.store %66, %subview_2[%arg2, %arg3, %arg4, %arg5, %c0] : memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, vector<8xf32>
}
}
}
}
}
}
return
}
}
}
failed to translate executables
// -----// IR Dump After TranslateExecutablesPass Failed (iree-hal-translate-executables) //----- //
hal.executable private @"torch-jit-export$async_dispatch_3" {
hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>) {
hal.executable.export public @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack" ordinal(0) layout(#hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>]} {
^bb0(%arg0: !hal.device, %arg1: index, %arg2: index, %arg3: index):
%c8 = arith.constant 8 : index
%c12 = arith.constant 12 : index
%c1 = arith.constant 1 : index
hal.return %c8, %c12, %c1 : index, index, index
}
builtin.module {
func.func @"torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"() attributes {translation_info = #iree_codegen.translation_info<CPUDataTiling>} {
%cst = arith.constant dense<0.000000e+00> : vector<8xf32>
%c7 = arith.constant 7 : index
%c6 = arith.constant 6 : index
%c5 = arith.constant 5 : index
%c3 = arith.constant 3 : index
%c32_i64 = arith.constant 32 : i64
%c48 = arith.constant 48 : index
%c16 = arith.constant 16 : index
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c32 = arith.constant 32 : index
%c64 = arith.constant 64 : index
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%alloca = memref.alloca() {alignment = 64 : i64} : memref<1x1x8x4xf32>
%0 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(0) : i32
%1 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(1) : i32
%2 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(2) : i32
%3 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(3) : i32
%4 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(4) : i32
%5 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(5) : i32
%6 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(6) : i32
%7 = hal.interface.constant.load layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) ordinal(7) : i32
%8 = arith.extui %0 : i32 to i64
%9 = arith.extui %1 : i32 to i64
%10 = arith.shli %9, %c32_i64 : i64
%11 = arith.ori %8, %10 : i64
%12 = arith.index_castui %11 : i64 to index
%13 = arith.extui %2 : i32 to i64
%14 = arith.extui %3 : i32 to i64
%15 = arith.shli %14, %c32_i64 : i64
%16 = arith.ori %13, %15 : i64
%17 = arith.index_castui %16 : i64 to index
%18 = arith.extui %4 : i32 to i64
%19 = arith.extui %5 : i32 to i64
%20 = arith.shli %19, %c32_i64 : i64
%21 = arith.ori %18, %20 : i64
%22 = arith.index_castui %21 : i64 to index
%23 = arith.extui %6 : i32 to i64
%24 = arith.extui %7 : i32 to i64
%25 = arith.shli %24, %c32_i64 : i64
%26 = arith.ori %23, %25 : i64
%27 = arith.index_castui %26 : i64 to index
%28 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%12) flags("ReadOnly|Indirect") : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>{%27}
memref.assume_alignment %28, 1 : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
%29 = hal.interface.binding.subspan layout(<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%17) flags(Indirect) : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>{%27}
memref.assume_alignment %29, 1 : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%30 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%22]
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%31 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
%32 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
%33 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_id_x]
%34 = affine.apply affine_map<()[s0] -> (s0 * 2)>()[%workgroup_count_x]
%alloca_0 = memref.alloca(%30) {alignment = 64 : i64} : memref<?x2x32x64xf32>
scf.for %arg0 = %31 to %c48 step %32 {
scf.for %arg1 = %33 to %c16 step %34 {
%subview = memref.subview %29[0, %arg0, %arg1, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%subview_1 = memref.subview %28[0, %arg1, %arg0, 0, 0, 0] [%30, 2, 4, 16, 8, 4] [1, 1, 1, 1, 1, 1] : memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>> to memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
scf.for %arg2 = %c0 to %30 step %c1 {
scf.for %arg3 = %c0 to %c2 step %c1 {
scf.for %arg4 = %c0 to %c32 step %c1 {
%35 = affine.apply affine_map<(d0) -> (d0 floordiv 8)>(%arg4)
%36 = affine.apply affine_map<(d0) -> (d0 mod 8)>(%arg4)
scf.for %arg5 = %c0 to %c64 step %c1 {
%37 = affine.apply affine_map<(d0) -> (d0 floordiv 4)>(%arg5)
%38 = affine.apply affine_map<(d0) -> (d0 mod 4)>(%arg5)
%39 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c0, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%40 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c1, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%41 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c2, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%42 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c3, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%43 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c4, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%44 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c5, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%45 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c6, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%46 = vector.load %subview_1[%arg2, %arg3, %35, %37, %c7, %c0] : memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, vector<4xf32>
%subview_2 = memref.subview %alloca[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<8x4xf32>
vector.store %39, %subview_2[%c0, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %40, %subview_2[%c1, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %41, %subview_2[%c2, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %42, %subview_2[%c3, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %43, %subview_2[%c4, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %44, %subview_2[%c5, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %45, %subview_2[%c6, %c0] : memref<8x4xf32>, vector<4xf32>
vector.store %46, %subview_2[%c7, %c0] : memref<8x4xf32>, vector<4xf32>
%subview_3 = memref.subview %alloca[0, 0, %36, %38] [1, 1, 1, 1] [1, 1, 1, 1] : memref<1x1x8x4xf32> to memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
%subview_4 = memref.subview %alloca_0[%arg2, %arg3, %arg4, %arg5] [1, 1, 1, 1] [1, 1, 1, 1] : memref<?x2x32x64xf32> to memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
%47 = memref.load %subview_3[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
memref.store %47, %subview_4[%c0, %c0, %c0, %c0] : memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
}
}
}
}
scf.for %arg2 = %c0 to %30 step %c1 {
scf.for %arg3 = %c0 to %c4 step %c1 {
scf.for %arg4 = %c0 to %c2 step %c1 {
scf.for %arg5 = %c0 to %c64 step %c1 {
%35 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg3)
%36 = memref.load %alloca_0[%arg2, %arg4, %35, %arg5] : memref<?x2x32x64xf32>
%37 = vector.broadcast %36 : f32 to vector<1xf32>
%38 = affine.apply affine_map<(d0) -> (d0 * 8 + 1)>(%arg3)
%39 = memref.load %alloca_0[%arg2, %arg4, %38, %arg5] : memref<?x2x32x64xf32>
%40 = vector.broadcast %39 : f32 to vector<1xf32>
%41 = affine.apply affine_map<(d0) -> (d0 * 8 + 2)>(%arg3)
%42 = memref.load %alloca_0[%arg2, %arg4, %41, %arg5] : memref<?x2x32x64xf32>
%43 = vector.broadcast %42 : f32 to vector<1xf32>
%44 = affine.apply affine_map<(d0) -> (d0 * 8 + 3)>(%arg3)
%45 = memref.load %alloca_0[%arg2, %arg4, %44, %arg5] : memref<?x2x32x64xf32>
%46 = vector.broadcast %45 : f32 to vector<1xf32>
%47 = affine.apply affine_map<(d0) -> (d0 * 8 + 4)>(%arg3)
%48 = memref.load %alloca_0[%arg2, %arg4, %47, %arg5] : memref<?x2x32x64xf32>
%49 = vector.broadcast %48 : f32 to vector<1xf32>
%50 = affine.apply affine_map<(d0) -> (d0 * 8 + 5)>(%arg3)
%51 = memref.load %alloca_0[%arg2, %arg4, %50, %arg5] : memref<?x2x32x64xf32>
%52 = vector.broadcast %51 : f32 to vector<1xf32>
%53 = affine.apply affine_map<(d0) -> (d0 * 8 + 6)>(%arg3)
%54 = memref.load %alloca_0[%arg2, %arg4, %53, %arg5] : memref<?x2x32x64xf32>
%55 = vector.broadcast %54 : f32 to vector<1xf32>
%56 = affine.apply affine_map<(d0) -> (d0 * 8 + 7)>(%arg3)
%57 = memref.load %alloca_0[%arg2, %arg4, %56, %arg5] : memref<?x2x32x64xf32>
%58 = vector.broadcast %57 : f32 to vector<1xf32>
%59 = vector.insert_strided_slice %37, %cst {offsets = [0], strides = [1]} : vector<1xf32> into vector<8xf32>
%60 = vector.insert_strided_slice %40, %59 {offsets = [1], strides = [1]} : vector<1xf32> into vector<8xf32>
%61 = vector.insert_strided_slice %43, %60 {offsets = [2], strides = [1]} : vector<1xf32> into vector<8xf32>
%62 = vector.insert_strided_slice %46, %61 {offsets = [3], strides = [1]} : vector<1xf32> into vector<8xf32>
%63 = vector.insert_strided_slice %49, %62 {offsets = [4], strides = [1]} : vector<1xf32> into vector<8xf32>
%64 = vector.insert_strided_slice %52, %63 {offsets = [5], strides = [1]} : vector<1xf32> into vector<8xf32>
%65 = vector.insert_strided_slice %55, %64 {offsets = [6], strides = [1]} : vector<1xf32> into vector<8xf32>
%66 = vector.insert_strided_slice %58, %65 {offsets = [7], strides = [1]} : vector<1xf32> into vector<8xf32>
%subview_2 = memref.subview %subview[0, 0, 0, 0, 0, 0] [%30, 4, 2, 64, 8, 1] [1, 1, 1, 1, 1, 1] : memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>> to memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
vector.store %66, %subview_2[%arg2, %arg3, %arg4, %arg5, %c0] : memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, vector<8xf32>
}
}
}
}
}
}
return
}
}
}
}
model.torch_onnx.mlir:3:12: error: 'memref.alloca' op expected no unbounded stack allocations
%512 = torch.operator "onnx.MatMul"(%arg4, %arg3) : (!torch.vtensor<[?,16,384,384],f32>, !torch.vtensor<[?,16,384,64],f32>) -> !torch.vtensor<[?,16,384,64],f32>
^
model.torch_onnx.mlir:3:12: note: see current operation: %54 = "memref.alloca"(%45) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 1, 0>}> : (index) -> memref<?x2x32x64xf32>
model.torch_onnx.mlir:17:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
%530 = torch.operator "onnx.MatMul"(%528, %arg5) : (!torch.vtensor<[?,384,1024],f32>, !torch.vtensor<[1024,1024],f32>) -> !torch.vtensor<[?,384,1024],f32>
^
model.torch_onnx.mlir:17:12: note: see current operation:
"hal.executable.variant"() ({
"hal.executable.export"() ({
^bb0(%arg10: !hal.device, %arg11: index, %arg12: index, %arg13: index):
%106 = "arith.constant"() <{value = 8 : index}> : () -> index
%107 = "arith.constant"() <{value = 12 : index}> : () -> index
%108 = "arith.constant"() <{value = 1 : index}> : () -> index
"hal.return"(%106, %107, %108) : (index, index, index) -> ()
}) {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>], layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 0 : index, sym_name = "torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"} : () -> ()
"builtin.module"() ({
"func.func"() <{function_type = () -> (), sym_name = "torch-jit-export$async_dispatch_3_unpack_transpose_Dx384x16x64_f32_pack"}> ({
%0 = "arith.constant"() <{value = dense<0.000000e+00> : vector<8xf32>}> : () -> vector<8xf32>
%1 = "arith.constant"() <{value = 7 : index}> : () -> index
%2 = "arith.constant"() <{value = 6 : index}> : () -> index
%3 = "arith.constant"() <{value = 5 : index}> : () -> index
%4 = "arith.constant"() <{value = 3 : index}> : () -> index
%5 = "arith.constant"() <{value = 32 : i64}> : () -> i64
%6 = "arith.constant"() <{value = 48 : index}> : () -> index
%7 = "arith.constant"() <{value = 16 : index}> : () -> index
%8 = "arith.constant"() <{value = 0 : index}> : () -> index
%9 = "arith.constant"() <{value = 2 : index}> : () -> index
%10 = "arith.constant"() <{value = 32 : index}> : () -> index
%11 = "arith.constant"() <{value = 64 : index}> : () -> index
%12 = "arith.constant"() <{value = 1 : index}> : () -> index
%13 = "arith.constant"() <{value = 4 : index}> : () -> index
%14 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}> : () -> memref<1x1x8x4xf32>
%15 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 0 : index} : () -> i32
%16 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 1 : index} : () -> i32
%17 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 2 : index} : () -> i32
%18 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 3 : index} : () -> i32
%19 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 4 : index} : () -> i32
%20 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 5 : index} : () -> i32
%21 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 6 : index} : () -> i32
%22 = "hal.interface.constant.load"() {layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, ordinal = 7 : index} : () -> i32
%23 = "arith.extui"(%15) : (i32) -> i64
%24 = "arith.extui"(%16) : (i32) -> i64
%25 = "arith.shli"(%24, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
%26 = "arith.ori"(%23, %25) : (i64, i64) -> i64
%27 = "arith.index_castui"(%26) : (i64) -> index
%28 = "arith.extui"(%17) : (i32) -> i64
%29 = "arith.extui"(%18) : (i32) -> i64
%30 = "arith.shli"(%29, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
%31 = "arith.ori"(%28, %30) : (i64, i64) -> i64
%32 = "arith.index_castui"(%31) : (i64) -> index
%33 = "arith.extui"(%19) : (i32) -> i64
%34 = "arith.extui"(%20) : (i32) -> i64
%35 = "arith.shli"(%34, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
%36 = "arith.ori"(%33, %35) : (i64, i64) -> i64
%37 = "arith.index_castui"(%36) : (i64) -> index
%38 = "arith.extui"(%21) : (i32) -> i64
%39 = "arith.extui"(%22) : (i32) -> i64
%40 = "arith.shli"(%39, %5) <{overflowFlags = #arith.overflow<none>}> : (i64, i64) -> i64
%41 = "arith.ori"(%38, %40) : (i64, i64) -> i64
%42 = "arith.index_castui"(%41) : (i64) -> index
%43 = "hal.interface.binding.subspan"(%27, %42) {alignment = 64 : index, binding = 0 : index, descriptor_flags = 3 : i32, layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, operandSegmentSizes = array<i32: 1, 1>, set = 0 : index} : (index, index) -> memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
"memref.assume_alignment"(%43) <{alignment = 1 : i32}> : (memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>) -> ()
%44 = "hal.interface.binding.subspan"(%32, %42) {alignment = 64 : index, binding = 1 : index, descriptor_flags = 2 : i32, layout = #hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, Indirect>], flags = Indirect>]>, operandSegmentSizes = array<i32: 1, 1>, set = 0 : index} : (index, index) -> memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
"memref.assume_alignment"(%44) <{alignment = 1 : i32}> : (memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>) -> ()
%45 = "affine.apply"(%37) <{map = affine_map<()[s0] -> (s0 floordiv 16)>}> : (index) -> index
%46 = "hal.interface.workgroup.id"() {dimension = 0 : index} : () -> index
%47 = "hal.interface.workgroup.count"() {dimension = 0 : index} : () -> index
%48 = "hal.interface.workgroup.id"() {dimension = 1 : index} : () -> index
%49 = "hal.interface.workgroup.count"() {dimension = 1 : index} : () -> index
%50 = "affine.apply"(%48) <{map = affine_map<()[s0] -> (s0 * 4)>}> : (index) -> index
%51 = "affine.apply"(%49) <{map = affine_map<()[s0] -> (s0 * 4)>}> : (index) -> index
%52 = "affine.apply"(%46) <{map = affine_map<()[s0] -> (s0 * 2)>}> : (index) -> index
%53 = "affine.apply"(%47) <{map = affine_map<()[s0] -> (s0 * 2)>}> : (index) -> index
%54 = "memref.alloca"(%45) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 1, 0>}> : (index) -> memref<?x2x32x64xf32>
"scf.for"(%50, %6, %51) ({
^bb0(%arg0: index):
"scf.for"(%52, %7, %53) ({
^bb0(%arg1: index):
%55 = "memref.subview"(%44, %arg0, %arg1, %45) <{operandSegmentSizes = array<i32: 1, 2, 1, 0>, static_offsets = array<i64: 0, -9223372036854775808, -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: -9223372036854775808, 4, 2, 64, 8, 1>, static_strides = array<i64: 1, 1, 1, 1, 1, 1>}> : (memref<?x48x16x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>, index, index, index) -> memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>
%56 = "memref.subview"(%43, %arg1, %arg0, %45) <{operandSegmentSizes = array<i32: 1, 2, 1, 0>, static_offsets = array<i64: 0, -9223372036854775808, -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: -9223372036854775808, 2, 4, 16, 8, 4>, static_strides = array<i64: 1, 1, 1, 1, 1, 1>}> : (memref<?x16x48x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index) -> memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>
"scf.for"(%8, %45, %12) ({
^bb0(%arg6: index):
"scf.for"(%8, %9, %12) ({
^bb0(%arg7: index):
"scf.for"(%8, %10, %12) ({
^bb0(%arg8: index):
%90 = "affine.apply"(%arg8) <{map = affine_map<(d0) -> (d0 floordiv 8)>}> : (index) -> index
%91 = "affine.apply"(%arg8) <{map = affine_map<(d0) -> (d0 mod 8)>}> : (index) -> index
"scf.for"(%8, %11, %12) ({
^bb0(%arg9: index):
%92 = "affine.apply"(%arg9) <{map = affine_map<(d0) -> (d0 floordiv 4)>}> : (index) -> index
%93 = "affine.apply"(%arg9) <{map = affine_map<(d0) -> (d0 mod 4)>}> : (index) -> index
%94 = "vector.load"(%56, %arg6, %arg7, %90, %92, %8, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%95 = "vector.load"(%56, %arg6, %arg7, %90, %92, %12, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%96 = "vector.load"(%56, %arg6, %arg7, %90, %92, %9, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%97 = "vector.load"(%56, %arg6, %arg7, %90, %92, %4, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%98 = "vector.load"(%56, %arg6, %arg7, %90, %92, %13, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%99 = "vector.load"(%56, %arg6, %arg7, %90, %92, %3, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%100 = "vector.load"(%56, %arg6, %arg7, %90, %92, %2, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%101 = "vector.load"(%56, %arg6, %arg7, %90, %92, %1, %8) <{nontemporal = false}> : (memref<?x2x4x16x8x4xf32, strided<[393216, 24576, 512, 32, 4, 1], offset: ?>>, index, index, index, index, index, index) -> vector<4xf32>
%102 = "memref.subview"(%14) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0, 0, 0, 0>, static_sizes = array<i64: 1, 1, 8, 4>, static_strides = array<i64: 1, 1, 1, 1>}> : (memref<1x1x8x4xf32>) -> memref<8x4xf32>
"vector.store"(%94, %102, %8, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%95, %102, %12, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%96, %102, %9, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%97, %102, %4, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%98, %102, %13, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%99, %102, %3, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%100, %102, %2, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
"vector.store"(%101, %102, %1, %8) <{nontemporal = false}> : (vector<4xf32>, memref<8x4xf32>, index, index) -> ()
%103 = "memref.subview"(%14, %91, %93) <{operandSegmentSizes = array<i32: 1, 2, 0, 0>, static_offsets = array<i64: 0, 0, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 1, 1, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (memref<1x1x8x4xf32>, index, index) -> memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>
%104 = "memref.subview"(%54, %arg6, %arg7, %arg8, %arg9) <{operandSegmentSizes = array<i32: 1, 4, 0, 0>, static_offsets = array<i64: -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808>, static_sizes = array<i64: 1, 1, 1, 1>, static_strides = array<i64: 1, 1, 1, 1>}> : (memref<?x2x32x64xf32>, index, index, index, index) -> memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>
%105 = "memref.load"(%103, %8, %8, %8, %8) <{nontemporal = false}> : (memref<1x1x1x1xf32, strided<[32, 32, 4, 1], offset: ?>>, index, index, index, index) -> f32
"memref.store"(%105, %104, %8, %8, %8, %8) <{nontemporal = false}> : (f32, memref<1x1x1x1xf32, strided<[4096, 2048, 64, 1], offset: ?>>, index, index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.for"(%8, %45, %12) ({
^bb0(%arg2: index):
"scf.for"(%8, %13, %12) ({
^bb0(%arg3: index):
"scf.for"(%8, %9, %12) ({
^bb0(%arg4: index):
"scf.for"(%8, %11, %12) ({
^bb0(%arg5: index):
%57 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8)>}> : (index) -> index
%58 = "memref.load"(%54, %arg2, %arg4, %57, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%59 = "vector.broadcast"(%58) : (f32) -> vector<1xf32>
%60 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 1)>}> : (index) -> index
%61 = "memref.load"(%54, %arg2, %arg4, %60, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%62 = "vector.broadcast"(%61) : (f32) -> vector<1xf32>
%63 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 2)>}> : (index) -> index
%64 = "memref.load"(%54, %arg2, %arg4, %63, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%65 = "vector.broadcast"(%64) : (f32) -> vector<1xf32>
%66 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 3)>}> : (index) -> index
%67 = "memref.load"(%54, %arg2, %arg4, %66, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%68 = "vector.broadcast"(%67) : (f32) -> vector<1xf32>
%69 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 4)>}> : (index) -> index
%70 = "memref.load"(%54, %arg2, %arg4, %69, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%71 = "vector.broadcast"(%70) : (f32) -> vector<1xf32>
%72 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 5)>}> : (index) -> index
%73 = "memref.load"(%54, %arg2, %arg4, %72, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%74 = "vector.broadcast"(%73) : (f32) -> vector<1xf32>
%75 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 6)>}> : (index) -> index
%76 = "memref.load"(%54, %arg2, %arg4, %75, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%77 = "vector.broadcast"(%76) : (f32) -> vector<1xf32>
%78 = "affine.apply"(%arg3) <{map = affine_map<(d0) -> (d0 * 8 + 7)>}> : (index) -> index
%79 = "memref.load"(%54, %arg2, %arg4, %78, %arg5) <{nontemporal = false}> : (memref<?x2x32x64xf32>, index, index, index, index) -> f32
%80 = "vector.broadcast"(%79) : (f32) -> vector<1xf32>
%81 = "vector.insert_strided_slice"(%59, %0) <{offsets = [0], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%82 = "vector.insert_strided_slice"(%62, %81) <{offsets = [1], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%83 = "vector.insert_strided_slice"(%65, %82) <{offsets = [2], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%84 = "vector.insert_strided_slice"(%68, %83) <{offsets = [3], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%85 = "vector.insert_strided_slice"(%71, %84) <{offsets = [4], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%86 = "vector.insert_strided_slice"(%74, %85) <{offsets = [5], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%87 = "vector.insert_strided_slice"(%77, %86) <{offsets = [6], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%88 = "vector.insert_strided_slice"(%80, %87) <{offsets = [7], strides = [1]}> : (vector<1xf32>, vector<8xf32>) -> vector<8xf32>
%89 = "memref.subview"(%55, %45) <{operandSegmentSizes = array<i32: 1, 0, 1, 0>, static_offsets = array<i64: 0, 0, 0, 0, 0, 0>, static_sizes = array<i64: -9223372036854775808, 4, 2, 64, 8, 1>, static_strides = array<i64: 1, 1, 1, 1, 1, 1>}> : (memref<?x4x2x64x8x1xf32, strided<[393216, 8192, 512, 8, 1, 1], offset: ?>>, index) -> memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>
"vector.store"(%88, %89, %arg2, %arg3, %arg4, %arg5, %8) <{nontemporal = false}> : (vector<8xf32>, memref<?x4x2x64x8xf32, strided<[393216, 8192, 512, 8, 1], offset: ?>>, index, index, index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
"func.return"() : () -> ()
}) {translation_info = #iree_codegen.translation_info<CPUDataTiling>} : () -> ()
}) : () -> ()
"hal.executable.variant_end"() : () -> ()
}) {sym_name = "embedded_elf_x86_64", target = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>} : () -> ()
Steps to reproduce your issue
command to reproduce the issue:
iree-compile model.torch_onnx.mlir --iree-hal-target-backends=llvm-cpu --iree-input-demote-i64-to-i32
IREE version: IREE compiler version 20240819.990 @ aeda14995f16ed1302db616adf0c03acf80f27ee LLVM version 20.0.0git
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response
@lialan if you are looking at this please attach the IR with dump after all. I can redirect appropriately
Problematic memref.alloca:
%54 = "memref.alloca"(%45) <{alignment = 64 : i64, operandSegmentSizes = array<i32: 1, 0>}> : (index) -> memref<?x2x32x64xf32>
contains dynamic shape. the dynamic size: %45 was defined as:
%45 = "affine.apply"(%37) <{map = affine_map<()[s0] -> (s0 floordiv 16)>}> : (index) -> index
again the value cannot be simply simplified, hence the check for alloca size bound failed.
@MaheshRavishankar the LLVM IR is attached in @pdhirajkumarprasad 's comment, just grep the keywords in this page.
Thats not enought to see what is going on. I see the alloca is dynamic, but need to see IR dump after all to really understand... for all such bug reports it will be easier for me to redirect if I can just get the dump with --mlir-print-ir-after-all --mlir-print-ir-before-all --mlir-disable-threading --mlir-elide-elementsattrs-if-larger=4 .
@MaheshRavishankar see attached: 18297.mlir.txt
Seems like we are missing a folder for unpack -> pack
%34 = affine.apply affine_map<()[s0] -> (s0 floordiv 16)>()[%32]
%35 = tensor.empty(%34) : tensor<?x16x384x64xf32>
%unpack = tensor.unpack %33 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %35 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 4, 2, 64], [1, 1, 1, 1], [0, 0, 0, 0], [0, 0, 0, 0]]>} : tensor<?x16x48x16x8x4xf32> -> tensor<?x16x384x64xf32>
%36 = tensor.empty(%34) : tensor<?x48x16x64x8x1xf32>
%pack = tensor.pack %unpack outer_dims_perm = [0, 2, 1, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 1] into %36 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 4, 2, 64], [1, 1, 1, 1]]>} : tensor<?x16x384x64xf32> -> tensor<?x48x16x64x8x1xf32>
This is just a transpose AFAICS.
@pashu123 could you fix this?
Confirm the issue is gone along with #18296 in latest main branch.
Moving this out of CPU project board.
@pashu123 still need to fuse pack+unpack, per @MaheshRavishankar
Is this failing again on main? Last check was that this isnt failing on mai.
Is this failing again on main? Last check was that this isnt failing on mai.
I verified this is working on the main. We can close this and create a new issue for the pack+unpack folding.
Verified, issue don't exist in latest build