OpenBLAS Openblas sgemm is slower for small size matrices in aarch64

I have built openblas in graviton3E with make USE_OPENMP=1 NUM_THREADS=256 TARGET=NEOVERSEV1. mkl is built in icelake machine.

I have used openblas sgemm as cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);

When performance timings are compared with intel mkl for the the smaller size matmuls, aarch64 is slower .

These are the different shapes I have checked and their timings.

Mar 26 '24 09:03 akote123

OpenBLAS does not currently provide dedicated GEMM kernels for "small" matrix sizes on ARM64, and may be switching to multithreading too early. (Also not sure if MKL would perhaps be employing GEMV here for an 1-by-N matrix, certainly a special case that OpenBLAS does not try to exploit)

Mar 26 '24 11:03 martin-frbg

@martin-frbg, Thank you. Is there plan to improve GEMM kernels for small matrix sizes on ARM.

Mar 26 '24 11:03 akote123

General plans to improve "everything" but no ETA - this project does not have much in the way of a permanent team behind it at present, so progress tends to be a bit unpredictable, often driven by outside contributions.

Mar 26 '24 12:03 martin-frbg

Would be interesting to see how respective gemv equivalents perform in particular case.

Mar 28 '24 21:03 brada4

Two other options might be interesting:

libxsmm which is a library specifically to address small-matrix multiplication, including narrow/tall matrices. However, only the 1x512 * 512x512 matrix fits in their recommended size of $(M N K)^{1/3} <= 64$, but it still may be worth the try. I have seen this library perform well on neoverse-v1 (Graviton3).

Arm Performance Libraries has many BLAS functions specifically targeted and optimized for aarch64. I'm curious if they perform better for this small matrix test case.

May 07 '24 13:05 lrbison

Gemm with one dimension eq 1 can be cadt down to gemv. Question is whether those libraries use that trick.

May 07 '24 18:05 brada4

@lrbison , Thank you. I have checked libxsmm for batchsize = 1, m = 1, n = 512, k = 2048 got 919us in graviton3 and 887us in Icelake.

May 08 '24 03:05 akote123

@akote123 Hm, I tried to reproduce, but I got different results. I'm using OpenBLAS 0.3.26 as compiled by spack:

[email protected]%[email protected]~bignuma~consistent_fpcsr+dynamic_dispatch+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-ubuntu20.04-neoverse_v1

I've also tested the threads=none variant. For testing, I did not use cblas_sgemm, but instead sgemm_ directly, and stored my matrices column major. ie, my call was:

sgemm_("n", "n", &m, &n, &k, &one, a, &m, b, &k, &zero, c, &m)

The results are dramatically different from yours. While I have not tried transposing my matrices, I suspect there is something more going on. This was run on c7g.8xlarge (32 cores).

May 16 '24 04:05 lrbison

m=1 is anomalous as it is equivalent of gemv (1st "matrix" is actually a vector)

May 16 '24 06:05 brada4

@brada4 You are right of course, but I didn't get time to add that to my test case last night. I've got new data this morning:

Additionally I just checked ArmPL, and it seems they catch this special case and call into sgemv, since their timings are nearly identical in both cases, and very similar to OpenBLAS sgemv times as well.

May 16 '24 15:05 lrbison

Thank you very much - I do wonder what version akote123 is/was using, as timings consistently getting worse when going from 1 to n threads for fairly large problem sizes is a bit unexpected

May 16 '24 15:05 martin-frbg

I have used openblas 0.3.26. @lrbison, I haven't set OMP_NUM_THREADS. For core setting I have used taskset. I have used below code to benchmark. I have taken timings in c7gn.8xlarge.

    for (i = 0; i < 100; i++) {
     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);
    }    
    double time_avg = (double)(clock()-start_t)/CLOCKS_PER_SEC/100;
    fprintf(stdout, "%lf\n",time_avg); ```

May 16 '24 15:05 akote123

@akote123 I believe the issue is that you are using clock() but clock measures CPU time, not wall-clock time. That means each thread is adding ticks in parallel.

See https://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs

May 16 '24 16:05 lrbison

@martin-frbg has OpenBLAS has considered calling into gemv from gemm in these kinds of special cases? If I tinkered around to do so would you consider accepting a PR, or is it just not worth it?

May 20 '24 19:05 lrbison

The topic has come up a few times in the past e.g. #528 and I have just created a rough draft for the fairly trivial change to add in interface/gemm.c . But if you have written something already in parallel with me, please do post your PR

May 20 '24 20:05 martin-frbg

uploaded what I currently have as #4708 - bound to be some embarassing coding errors in there still

May 20 '24 20:05 martin-frbg

OpenBLAS OpenBLAS copied to clipboard

Openblas sgemm is slower for small size matrices in aarch64

OpenBLAS
OpenBLAS copied to clipboard