Setting optimized `[SD]GEMM_DEFAULT_[PQR]` parameters for `A64FX`
Resolves #5553.
The parameters [SD]GEMM_DEFAULT_[PQR] have been tuned to obtain the performance improvement in [SD]GEMM under the multi-process evaluation using all cores of A64FX. This change improves the performance of [SD]GEMM shown in the left and center figures. In this pull-request, performance is compared between OpenBLAS v0.3.30 and modified one (labeled as update). I also confirmed that the performance improves under the single-process evaluation shown in right figure.
While the performance improves in most Level 3 BLAS kernels, the performance degrades in kernels related to triangular matrix (
TRMM and TRSM), which comes from the same reason described in Issue#4742.
Above figures show the performance change in
GEMM, TRMM and TRSM.
To understand the extent of the performance degradation in TRMM and TRSM, I evaluated the performance ratio relative to the v0.3.30 up to size=5,000 and summarized the results in the table below.
| kernel | update/v0.3.30 | (update/v0.3.30)-1 |
|---|---|---|
| dgemm.nn | 1.0846 | +0.0846 |
| dtrmm.n | 0.9268 | -0.0732 |
| dtrsm.n | 0.9398 | -0.0602 |
This indicates that while the pert of performance of TRMM and TRSM decreases, there are benefits to fine-turn the [SD]GEMM_DEFAULT_[PQR] parameters for A64FX.
Hi @hideaki-motoki -san
Overall LGTM.
For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0? This will help interpret the multi-process results on A64FX.
Hi, @abhishek-iitmadras -san. Thank you for reviewing the results.
For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? In particular, was OpenBLAS built with USE_OPENMP=1 or USE_OPENMP=0?
It was built with USE_OPENMP=1 as follows:
make DYNAMIC_ARCH=1 USE_THREAD=1 USE_OPENMP=1 NUM_THREADS=256