OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Openblas sgemm is slower for small size matrices in aarch64

Open akote123 opened this issue 1 year ago • 16 comments

I have built openblas in graviton3E with make USE_OPENMP=1 NUM_THREADS=256 TARGET=NEOVERSEV1. mkl is built in icelake machine.

I have used openblas sgemm as cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);

When performance timings are compared with intel mkl for the the smaller size matmuls, aarch64 is slower .

openblasvsmkl

These are the different shapes I have checked and their timings.

akote123 avatar Mar 26 '24 09:03 akote123

OpenBLAS does not currently provide dedicated GEMM kernels for "small" matrix sizes on ARM64, and may be switching to multithreading too early. (Also not sure if MKL would perhaps be employing GEMV here for an 1-by-N matrix, certainly a special case that OpenBLAS does not try to exploit)

martin-frbg avatar Mar 26 '24 11:03 martin-frbg

@martin-frbg, Thank you. Is there plan to improve GEMM kernels for small matrix sizes on ARM.

akote123 avatar Mar 26 '24 11:03 akote123

General plans to improve "everything" but no ETA - this project does not have much in the way of a permanent team behind it at present, so progress tends to be a bit unpredictable, often driven by outside contributions.

martin-frbg avatar Mar 26 '24 12:03 martin-frbg

Would be interesting to see how respective gemv equivalents perform in particular case.

brada4 avatar Mar 28 '24 21:03 brada4

Two other options might be interesting:

libxsmm which is a library specifically to address small-matrix multiplication, including narrow/tall matrices. However, only the 1x512 * 512x512 matrix fits in their recommended size of $(M N K)^{1/3} <= 64$, but it still may be worth the try. I have seen this library perform well on neoverse-v1 (Graviton3).

Arm Performance Libraries has many BLAS functions specifically targeted and optimized for aarch64. I'm curious if they perform better for this small matrix test case.

lrbison avatar May 07 '24 13:05 lrbison

Gemm with one dimension eq 1 can be cadt down to gemv. Question is whether those libraries use that trick.

brada4 avatar May 07 '24 18:05 brada4

@lrbison , Thank you. I have checked libxsmm for batchsize = 1, m = 1, n = 512, k = 2048 got 919us in graviton3 and 887us in Icelake.

akote123 avatar May 08 '24 03:05 akote123

@akote123 Hm, I tried to reproduce, but I got different results. I'm using OpenBLAS 0.3.26 as compiled by spack:

[email protected]%[email protected]~bignuma~consistent_fpcsr+dynamic_dispatch+fortran~ilp64+locking+pic+shared build_system=makefile symbol_suffix=none threads=openmp arch=linux-ubuntu20.04-neoverse_v1

I've also tested the threads=none variant. For testing, I did not use cblas_sgemm, but instead sgemm_ directly, and stored my matrices column major. ie, my call was:

sgemm_("n", "n", &m, &n, &k, &one, a, &m, b, &k, &zero, c, &m)

The results are dramatically different from yours. While I have not tried transposing my matrices, I suspect there is something more going on. This was run on c7g.8xlarge (32 cores).

image

lrbison avatar May 16 '24 04:05 lrbison

m=1 is anomalous as it is equivalent of gemv (1st "matrix" is actually a vector)

brada4 avatar May 16 '24 06:05 brada4

@brada4 You are right of course, but I didn't get time to add that to my test case last night. I've got new data this morning:

image

Additionally I just checked ArmPL, and it seems they catch this special case and call into sgemv, since their timings are nearly identical in both cases, and very similar to OpenBLAS sgemv times as well.

lrbison avatar May 16 '24 15:05 lrbison

Thank you very much - I do wonder what version akote123 is/was using, as timings consistently getting worse when going from 1 to n threads for fairly large problem sizes is a bit unexpected

martin-frbg avatar May 16 '24 15:05 martin-frbg

I have used openblas 0.3.26. @lrbison, I haven't set OMP_NUM_THREADS. For core setting I have used taskset. I have used below code to benchmark. I have taken timings in c7gn.8xlarge.

    for (i = 0; i < 100; i++) {
     cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1.0, A, K, B, N, 0.0, C, N);
    }    
    double time_avg = (double)(clock()-start_t)/CLOCKS_PER_SEC/100;
    fprintf(stdout, "%lf\n",time_avg); ```



akote123 avatar May 16 '24 15:05 akote123

@akote123 I believe the issue is that you are using clock() but clock measures CPU time, not wall-clock time. That means each thread is adding ticks in parallel.

See https://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs

lrbison avatar May 16 '24 16:05 lrbison

@martin-frbg has OpenBLAS has considered calling into gemv from gemm in these kinds of special cases? If I tinkered around to do so would you consider accepting a PR, or is it just not worth it?

lrbison avatar May 20 '24 19:05 lrbison

The topic has come up a few times in the past e.g. #528 and I have just created a rough draft for the fairly trivial change to add in interface/gemm.c . But if you have written something already in parallel with me, please do post your PR

martin-frbg avatar May 20 '24 20:05 martin-frbg

uploaded what I currently have as #4708 - bound to be some embarassing coding errors in there still

martin-frbg avatar May 20 '24 20:05 martin-frbg