OpenBLAS
OpenBLAS copied to clipboard
Openblas sgemm is slower for small size matrices in aarch64
I have built openblas in graviton3E with make USE_OPENMP=1 NUM_THREADS=256 TARGET=NEOVERSEV1. mkl is built in icelake machine.
I have used openblas sgemm as
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);
When performance timings are compared with intel mkl for the the smaller size matmuls, aarch64 is slower .
These are the different shapes I have checked and their timings.