oneDNN
oneDNN copied to clipboard
Fuse sum post op acl matmul
Description
Fuse the sum post op in acl matmul by setting the accumulate flag to true in arm_compute::GEMMInfo. This speeds up the post op and saves allocating a temporary dst sized tensor.
For example
OMP_NUM_THREADS=16 ./benchdnn --mode=p --matmul --attr-post-ops=sum,sum+relu:0 1024x256:256x2048
is ~12% faster and uses 1.2kB less peak memory.
We also added _for_sum
to use_dst_acc
flag to stop it being confused with the dst_acc
used for transposing.
Change the way we deal with fused eltwise (as well as the new sum) to fix segfaults when binary ops followed fused ops.
This PR includes the commit from #1889 because it builds on top of it.
Checklist
General
- [x] Do all unit and benchdnn tests (
make test
andmake test_benchdnn_*
) pass locally for each commit? - [x] Have you formatted the code using clang-format?
Performance improvements
- [x] Have you submitted performance data that demonstrates performance improvements?
Please don't submit this yet, we've identified a small issue and a fix is on the way
Fixed the failure, happy for this to be merged