mlir-aie
mlir-aie copied to clipboard
Request for two variants of Matmul microkernel
The matmul_vectorized microkernel seems to be overwriting the output buffer as zero.
If this is indeed expected from this microkernel then we would need another microkernel which doesn't perform this overwriting.
For a 2048x2048x2048xbf16
(MxNxK) matmul, I've attached the IR state before convert-scf-to-cf
- IR state.
Take a note at the following :-
scf.for %arg0 = %c0 to %c16 step %c1 {
scf.for %arg1 = %c0 to %c16 step %c1 {
scf.for %arg2 = %c0 to %c4 step %c1 {
scf.for %arg3 = %c0 to %c4 step %c1 {
memref.store %cst, %buf19[%arg0, %arg1, %arg2, %arg3] : memref<16x16x4x4xbf16, 2 : i32>
}
}
}
}
scf.for %arg0 = %c0 to %c256 step %c16 {
. . .
%base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %buf16 : memref<8x16x4x8xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
%base_buffer_29, %offset_30, %sizes_31:4, %strides_32:4 = memref.extract_strided_metadata %buf15 : memref<16x8x8x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
%base_buffer_33, %offset_34, %sizes_35:4, %strides_36:4 = memref.extract_strided_metadata %buf19 : memref<16x16x4x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
func.call @matmul_bf16_bf16(%base_buffer, %c0, %base_buffer_29, %c0, %base_buffer_33, %c0) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
In the above IR we have %cst = arith.constant 7.000000e+00 : bf16
, and during runtime if LHS is initialized with 2
and RHS with 3
, we should have 6 * 2048 + 7 = 12295
as output - but we get 12288
instead.
Attaching steps to repro
Summarizing the requirement :-
- We need to have both microkernels - one that enforces zero initialization and one which would use the init value of output buffer. The invoker of this API should then be able to switch between the two by a boolean flag.
Can someone who has worked on developing the .cc
microkernel please take this up?
CC: @MaheshRavishankar @erwei-xilinx @stephenneuendorffer
Thanks for creating the issue. Took the liberty to ping some experts on this field. @jackl-xilinx , @denolf , @fifield , @jgmelber .
I was able to reproduce the issue outside of IREE using input.mlir and a c++ testbench, but after looking closer I think there is no bug. I think the problem is that 12295
and 12288
are the same number in bfloat16. For this size integer in bf16 the least significant mantissa bit will represent 64
, so the next representable integer after 12288
is 12352
, and 12288 + 7 = 12288
. Can you try putting in a bigger constant? Or some numbers between 0 and 1 as LHS and RHS?
Hey @fifield !
Thank you so much for taking a loot at this! Indeed. Your findings are absolutely correct and makes sense. I tried with a bigger number (100
) and the result was 122388
.
This sorts us with having a ukernel that doesn't enforce zero filling.
I take it that for having the other variant - the one which enforces zero filling - I can just copy paste this zero filling code at the beginning of the current variant OR invoke that function from the current variant and be done with it.
Let me know your thoughts.
Closing this as it stands solved - we would use the zero filling kernel as it is and would be using the current ukernel of matmul.