Lower vector.contract to vector.outerproduct.
Implements the lowering of vector contraction op to vector outerproduct wrapped inside an scf.forloop. with iterargs to accumulate the result of each outerproduct corresponding to the K dimension size. The idea is to exploit the AVX feature to generate optimal vector code. Outerproduct gets lowered to FMAs X86 assembly through llvm intrinsic "llvm.intr.fmuladd".
I see large diff in results after applying this lowering.
Test example:
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
func.func @entry(%arg0: tensor<16x16xf32>, %arg1: tensor<16x16xf32>, %arg2: tensor<16x16xf32>) -> tensor<16x16xf32> {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%0 = vector.transfer_read %arg0[%c0, %c0], %cst{in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
%1 = vector.transfer_read %arg1[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
%2 = vector.transfer_read %arg2[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<16x16xf32>, vector<16x16xf32>
%3 = vector.contract {indexing_maps = [#map, #map1, #map2],
iterator_types = ["parallel", "parallel", "reduction"],
kind = #vector.kind<add>} %0, %1, %2
: vector<16x16xf32>, vector<16x16xf32> into vector<16x16xf32>
%4 = vector.transfer_write %3, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<16x16xf32>, tensor<16x16xf32>
return %4 : tensor<16x16xf32>
}
}
Baseline:
tpp-opt ../test.mlir | tpp-run -e entry --entry-point-result=void -seed 123 -print
After rewrite:
tpp-opt ../test.mlir --vector-contract-to-outerproduct | tpp-run -e entry --entry-point-result=void -seed 123 -print
I see large diff in results after applying this lowering.
Ah okay spotted the issue.
After vector.contract lowering:
%1 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %0) -> (vector<4x64xf32>) {
%3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>
%4 = vector.transfer_read %arg1[%arg3, %c0], %cst {in_bounds = [true]} : tensor<2x64xf32>, vector<64xf32>
%5 = vector.outerproduct %3, %4, %arg4 {kind = #vector.kind<add>} : vector<4xf32>, vector<64xf32>
scf.yield %5 : vector<4x64xf32>
}
to make reading from A
%3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>
a column load, a permutation map needs to be added like:
%3 = vector.transfer_read %arg0[%c0, %arg3], %cst {permutation_map = affine_map<(d0, d1) -> (d0)>} : tensor<4x2xf32>, vector<4xf32>
Otherwise, it just reads contiguous elements.
I see large diff in results after applying this lowering.
Ah okay spotted the issue. After
vector.contractlowering:%1 = scf.for %arg3 = %c0 to %c2 step %c1 iter_args(%arg4 = %0) -> (vector<4x64xf32>) { %3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32> %4 = vector.transfer_read %arg1[%arg3, %c0], %cst {in_bounds = [true]} : tensor<2x64xf32>, vector<64xf32> %5 = vector.outerproduct %3, %4, %arg4 {kind = #vector.kind<add>} : vector<4xf32>, vector<64xf32> scf.yield %5 : vector<4x64xf32> }to make reading from A
%3 = vector.transfer_read %arg0[%c0, %arg3], %cst : tensor<4x2xf32>, vector<4xf32>a column load, a permutation map needs to be added like:%3 = vector.transfer_read %arg0[%c0, %arg3], %cst {permutation_map = affine_map<(d0, d1) -> (d0)>} : tensor<4x2xf32>, vector<4xf32>Otherwise, it just reads contiguous elements.
Thanks, I assumed it could infer the map using indices.