tpp-mlir icon indicating copy to clipboard operation
tpp-mlir copied to clipboard

Lower vector.contract to vector.outerproduct.

Open shahidact opened this issue 1 year ago • 3 comments

Implements the lowering of vector contraction op to vector outerproduct wrapped inside an scf.forloop. with iterargs to accumulate the result of each outerproduct corresponding to the K dimension size. The idea is to exploit the AVX feature to generate optimal vector code. Outerproduct gets lowered to FMAs X86 assembly through llvm intrinsic "llvm.intr.fmuladd".

shahidact avatar Sep 17 '24 13:09 shahidact

I see large diff in results after applying this lowering.

Test example:

#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func @entry(%arg0: tensor<16x16xf32>, %arg1: tensor<16x16xf32>, %arg2: tensor<16x16xf32>) -> tensor<16x16xf32> {
    %c0 = arith.constant 0 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.transfer_read %arg0[%c0, %c0], %cst{in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
    %1 = vector.transfer_read %arg1[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
    %2 = vector.transfer_read %arg2[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
    %3 = vector.contract {indexing_maps = [#map, #map1, #map2],
      iterator_types = ["parallel", "parallel", "reduction"],
      kind = #vector.kind<add>} %0, %1, %2
      : vector<16x16xf32>, vector<16x16xf32> into vector<16x16xf32>
    %4 = vector.transfer_write %3, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<16x16xf32>, tensor<16x16xf32>
    return %4 : tensor<16x16xf32>
  }
}

Baseline: tpp-opt ../test.mlir | tpp-run -e entry --entry-point-result=void -seed 123 -print After rewrite: tpp-opt ../test.mlir --vector-contract-to-outerproduct | tpp-run -e entry --entry-point-result=void -seed 123 -print

adam-smnk avatar Sep 18 '24 13:09 adam-smnk

I see large diff in results after applying this lowering.

Ah okay spotted the issue. After vector.contract lowering:

%1 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %0) -> (vector<4x64xf32>) {
  %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>
  %4 = vector.transfer_read %arg1[%arg3, %c0], %cst {in_bounds = [true]} : tensor<2x64xf32>, vector<64xf32>
  %5 = vector.outerproduct %3, %4, %arg4 {kind = #vector.kind<add>} : vector<4xf32>, vector<64xf32>
  scf.yield %5 : vector<4x64xf32>
}

to make reading from A %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32> a column load, a permutation map needs to be added like: %3 = vector.transfer_read %arg0[%c0, %arg3], %cst {permutation_map = affine_map<(d0, d1) -> (d0)>} : tensor<4x2xf32>, vector<4xf32>

Otherwise, it just reads contiguous elements.

adam-smnk avatar Sep 18 '24 13:09 adam-smnk

I see large diff in results after applying this lowering.

Ah okay spotted the issue. After vector.contract lowering:

%1 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %0) -> (vector<4x64xf32>) {
  %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>
  %4 = vector.transfer_read %arg1[%arg3, %c0], %cst {in_bounds = [true]} : tensor<2x64xf32>, vector<64xf32>
  %5 = vector.outerproduct %3, %4, %arg4 {kind = #vector.kind<add>} : vector<4xf32>, vector<64xf32>
  scf.yield %5 : vector<4x64xf32>
}

to make reading from A %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32> a column load, a permutation map needs to be added like: %3 = vector.transfer_read %arg0[%c0, %arg3], %cst {permutation_map = affine_map<(d0, d1) -> (d0)>} : tensor<4x2xf32>, vector<4xf32>

Otherwise, it just reads contiguous elements.

Thanks, I assumed it could infer the map using indices.

shahidact avatar Sep 18 '24 13:09 shahidact