OpenBLAS Setting optimized `[SD]GEMM_DEFAULT_[PQR]` parameters for `A64FX`

Resolves #5553. The parameters [SD]GEMM_DEFAULT_[PQR] have been tuned to obtain the performance improvement in [SD]GEMM under the multi-process evaluation using all cores of A64FX. This change improves the performance of [SD]GEMM shown in the left and center figures. In this pull-request, performance is compared between OpenBLAS v0.3.30 and modified one (labeled as update). I also confirmed that the performance improves under the single-process evaluation shown in right figure. While the performance improves in most Level 3 BLAS kernels, the performance degrades in kernels related to triangular matrix (TRMM and TRSM), which comes from the same reason described in Issue#4742. Above figures show the performance change in GEMM, TRMM and TRSM. To understand the extent of the performance degradation in TRMM and TRSM, I evaluated the performance ratio relative to the v0.3.30 up to size=5,000 and summarized the results in the table below.

kernel	update/v0.3.30	(update/v0.3.30)-1
dgemm.nn	1.0846	+0.0846
dtrmm.n	0.9268	-0.0732
dtrsm.n	0.9398	-0.0602

This indicates that while the pert of performance of TRMM and TRSM decreases, there are benefits to fine-turn the [SD]GEMM_DEFAULT_[PQR] parameters for A64FX.

Nov 28 '25 05:11 hideaki-motoki

Hi @hideaki-motoki -san

Overall LGTM.

For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0? This will help interpret the multi-process results on A64FX.

Nov 29 '25 19:11 abhishek-iitmadras

Hi, @abhishek-iitmadras -san. Thank you for reviewing the results.

For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0?

It was built with USE_OPENMP=1 as follows: make DYNAMIC_ARCH=1 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=256

Dec 01 '25 04:12 hideaki-motoki