mlir-aie icon indicating copy to clipboard operation
mlir-aie copied to clipboard

Request for two variants of Matmul microkernel

Open Abhishek-Varma opened this issue 10 months ago • 3 comments

The matmul_vectorized microkernel seems to be overwriting the output buffer as zero.

If this is indeed expected from this microkernel then we would need another microkernel which doesn't perform this overwriting.

For a 2048x2048x2048xbf16 (MxNxK) matmul, I've attached the IR state before convert-scf-to-cf - IR state. Take a note at the following :-

scf.for %arg0 = %c0 to %c16 step %c1 {
  scf.for %arg1 = %c0 to %c16 step %c1 {
    scf.for %arg2 = %c0 to %c4 step %c1 {
      scf.for %arg3 = %c0 to %c4 step %c1 {
        memref.store %cst, %buf19[%arg0, %arg1, %arg2, %arg3] : memref<16x16x4x4xbf16, 2 : i32>
      }
    }
  }
}
scf.for %arg0 = %c0 to %c256 step %c16 {
  . . .
  %base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %buf16 : memref<8x16x4x8xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_29, %offset_30, %sizes_31:4, %strides_32:4 = memref.extract_strided_metadata %buf15 : memref<16x8x8x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_33, %offset_34, %sizes_35:4, %strides_36:4 = memref.extract_strided_metadata %buf19 : memref<16x16x4x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  func.call @matmul_bf16_bf16(%base_buffer, %c0, %base_buffer_29, %c0, %base_buffer_33, %c0) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()     

In the above IR we have %cst = arith.constant 7.000000e+00 : bf16, and during runtime if LHS is initialized with 2 and RHS with 3, we should have 6 * 2048 + 7 = 12295 as output - but we get 12288 instead.

Attaching steps to repro

Summarizing the requirement :-

  1. We need to have both microkernels - one that enforces zero initialization and one which would use the init value of output buffer. The invoker of this API should then be able to switch between the two by a boolean flag.

Can someone who has worked on developing the .cc microkernel please take this up?

CC: @MaheshRavishankar @erwei-xilinx @stephenneuendorffer

Abhishek-Varma avatar Mar 26 '24 10:03 Abhishek-Varma

Thanks for creating the issue. Took the liberty to ping some experts on this field. @jackl-xilinx , @denolf , @fifield , @jgmelber .

erwei-xilinx avatar Mar 26 '24 14:03 erwei-xilinx

I was able to reproduce the issue outside of IREE using input.mlir and a c++ testbench, but after looking closer I think there is no bug. I think the problem is that 12295 and 12288 are the same number in bfloat16. For this size integer in bf16 the least significant mantissa bit will represent 64, so the next representable integer after 12288 is 12352, and 12288 + 7 = 12288. Can you try putting in a bigger constant? Or some numbers between 0 and 1 as LHS and RHS?

fifield avatar Apr 03 '24 15:04 fifield

Hey @fifield !

Thank you so much for taking a loot at this! Indeed. Your findings are absolutely correct and makes sense. I tried with a bigger number (100) and the result was 122388.

This sorts us with having a ukernel that doesn't enforce zero filling.

I take it that for having the other variant - the one which enforces zero filling - I can just copy paste this zero filling code at the beginning of the current variant OR invoke that function from the current variant and be done with it.

Let me know your thoughts.

Abhishek-Varma avatar Apr 08 '24 15:04 Abhishek-Varma

Closing this as it stands solved - we would use the zero filling kernel as it is and would be using the current ukernel of matmul.

Abhishek-Varma avatar May 10 '24 16:05 Abhishek-Varma