tpp-mlir Lower vector.contract to vector.outerproduct.

Implements the lowering of vector contraction op to vector outerproduct wrapped inside an scf.forloop. with iterargs to accumulate the result of each outerproduct corresponding to the K dimension size. The idea is to exploit the AVX feature to generate optimal vector code. Outerproduct gets lowered to FMAs X86 assembly through llvm intrinsic "llvm.intr.fmuladd".

Sep 17 '24 13:09 shahidact

I see large diff in results after applying this lowering.

Test example:

#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func @entry(%arg0: tensor<16x16xf32>, %arg1: tensor<16x16xf32>, %arg2: tensor<16x16xf32>) -> tensor<16x16xf32> {
    %c0 = arith.constant 0 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.transfer_read %arg0[%c0, %c0], %cst{in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
    %1 = vector.transfer_read %arg1[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
    %2 = vector.transfer_read %arg2[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
    %3 = vector.contract {indexing_maps = [#map, #map1, #map2],
      iterator_types = ["parallel", "parallel", "reduction"],
      kind = #vector.kind<add>} %0, %1, %2
      : vector<16x16xf32>, vector<16x16xf32> into vector<16x16xf32>
    %4 = vector.transfer_write %3, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<16x16xf32>, tensor<16x16xf32>
    return %4 : tensor<16x16xf32>
  }
}

Baseline: tpp-opt ../test.mlir | tpp-run -e entry --entry-point-result=void -seed 123 -print After rewrite: tpp-opt ../test.mlir --vector-contract-to-outerproduct | tpp-run -e entry --entry-point-result=void -seed 123 -print

Sep 18 '24 13:09 adam-smnk

I see large diff in results after applying this lowering.

Ah okay spotted the issue. After vector.contract lowering:

%1 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %0) -> (vector<4x64xf32>) {
  %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>
  %4 = vector.transfer_read %arg1[%arg3, %c0], %cst {in_bounds = [true]} : tensor<2x64xf32>, vector<64xf32>
  %5 = vector.outerproduct %3, %4, %arg4 {kind = #vector.kind<add>} : vector<4xf32>, vector<64xf32>
  scf.yield %5 : vector<4x64xf32>
}

to make reading from A %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32> a column load, a permutation map needs to be added like: %3 = vector.transfer_read %arg0[%c0, %arg3], %cst {permutation_map = affine_map<(d0, d1) -> (d0)>} : tensor<4x2xf32>, vector<4xf32>

Otherwise, it just reads contiguous elements.

Sep 18 '24 13:09 adam-smnk

I see large diff in results after applying this lowering.

Ah okay spotted the issue. After vector.contract lowering:
%1 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %0) -> (vector<4x64xf32>) {
  %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>
  %4 = vector.transfer_read %arg1[%arg3, %c0], %cst {in_bounds = [true]} : tensor<2x64xf32>, vector<64xf32>
  %5 = vector.outerproduct %3, %4, %arg4 {kind = #vector.kind<add>} : vector<4xf32>, vector<64xf32>
  scf.yield %5 : vector<4x64xf32>
}
to make reading from A %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32> a column load, a permutation map needs to be added like: %3 = vector.transfer_read %arg0[%c0, %arg3], %cst {permutation_map = affine_map<(d0, d1) -> (d0)>} : tensor<4x2xf32>, vector<4xf32>

Otherwise, it just reads contiguous elements.

Thanks, I assumed it could infer the map using indices.

Sep 18 '24 13:09 shahidact