OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Setting optimized `[SD]GEMM_DEFAULT_[PQR]` parameters for `A64FX`

Open hideaki-motoki opened this issue 1 month ago • 2 comments

Resolves #5553. The parameters [SD]GEMM_DEFAULT_[PQR] have been tuned to obtain the performance improvement in [SD]GEMM under the multi-process evaluation using all cores of A64FX. This change improves the performance of [SD]GEMM shown in the left and center figures. In this pull-request, performance is compared between OpenBLAS v0.3.30 and modified one (labeled as update). I also confirmed that the performance improves under the single-process evaluation shown in right figure. 1 While the performance improves in most Level 3 BLAS kernels, the performance degrades in kernels related to triangular matrix (TRMM and TRSM), which comes from the same reason described in Issue#4742. 2 Above figures show the performance change in GEMM, TRMM and TRSM. To understand the extent of the performance degradation in TRMM and TRSM, I evaluated the performance ratio relative to the v0.3.30 up to size=5,000 and summarized the results in the table below.

kernel update/v0.3.30 (update/v0.3.30)-1
dgemm.nn 1.0846 +0.0846
dtrmm.n 0.9268 -0.0732
dtrsm.n 0.9398 -0.0602

This indicates that while the pert of performance of TRMM and TRSM decreases, there are benefits to fine-turn the [SD]GEMM_DEFAULT_[PQR] parameters for A64FX.

hideaki-motoki avatar Nov 28 '25 05:11 hideaki-motoki

Hi @hideaki-motoki -san

Overall LGTM.

For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0? This will help interpret the multi-process results on A64FX.

abhishek-iitmadras avatar Nov 29 '25 19:11 abhishek-iitmadras

Hi, @abhishek-iitmadras -san. Thank you for reviewing the results.

For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0?

It was built with USE_OPENMP=1 as follows: make DYNAMIC_ARCH=1 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=256

hideaki-motoki avatar Dec 01 '25 04:12 hideaki-motoki