blasfeo blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used?

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used?

Open AnonymousYWL opened this issue 4 years ago • 9 comments

Oct 21 '20 12:10 AnonymousYWL

Hi, where did you see that only the 4x4 microkernel can be used? Actually, in the BLAS_API sgemm algorithm for ARM Cortex A53 and A57, the 8x8 kernel is implemented and used https://github.com/giaf/blasfeo/blob/master/blasfeo_hp_cm/sgemm.c#L431

Oct 24 '20 21:10 giaf

Does code size affect the performance of small GEMMs?

Nov 14 '20 03:11 AnonymousYWL

In general I don't expect code side to affect the performance of small GEMMs much, at least once it is loaded in instruction cache, in case of multiple calls to GEMM routines.

Nov 14 '20 11:11 giaf

What if it runs only once?

Nov 14 '20 11:11 AnonymousYWL

Then for small matrices it may be that the overhead of loading data and code from main memory is the limiting factor. But it is difficult to say a priori, you should benchmark/profile your application.

Nov 14 '20 18:11 giaf

( Some answers given here are also applicable here as well. )

Nov 16 '20 08:11 hfp

@hfp thanks for sharing the link to your issue, interesting reading!

Nov 16 '20 08:11 giaf

Thank you for your previous reply. I would like to ask: is it reasonable to run multiple times and average the performance of small-scale GEMM?

Nov 16 '20 13:11 AnonymousYWL

IMO it is, as this is a rather common case in practice. In many cases, the (small) matrices are already in cache as the result of some previous operation, and the same appliers to the code. In particular, in BLASFEO the "nano-kernels" are special functions shared between several linear algebra routines, so it is very likely that they keep being used and stay in cache.

On the other hand, you can always build an example where both code and data are cold. At the end of the day, it depends on your specific application, and since you didn't share much information about it, it's up to you to judge it.

Nov 16 '20 22:11 giaf

blasfeo blasfeo copied to clipboard

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used?

blasfeo
blasfeo copied to clipboard