blasfeo
blasfeo copied to clipboard
blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used?
Hi, where did you see that only the 4x4 microkernel can be used? Actually, in the BLAS_API sgemm algorithm for ARM Cortex A53 and A57, the 8x8 kernel is implemented and used https://github.com/giaf/blasfeo/blob/master/blasfeo_hp_cm/sgemm.c#L431
Does code size affect the performance of small GEMMs?
In general I don't expect code side to affect the performance of small GEMMs much, at least once it is loaded in instruction cache, in case of multiple calls to GEMM routines.
What if it runs only once?
Then for small matrices it may be that the overhead of loading data and code from main memory is the limiting factor. But it is difficult to say a priori, you should benchmark/profile your application.
( Some answers given here are also applicable here as well. )
@hfp thanks for sharing the link to your issue, interesting reading!
Thank you for your previous reply. I would like to ask: is it reasonable to run multiple times and average the performance of small-scale GEMM?
IMO it is, as this is a rather common case in practice. In many cases, the (small) matrices are already in cache as the result of some previous operation, and the same appliers to the code. In particular, in BLASFEO the "nano-kernels" are special functions shared between several linear algebra routines, so it is very likely that they keep being used and stay in cache.
On the other hand, you can always build an example where both code and data are cold. At the end of the day, it depends on your specific application, and since you didn't share much information about it, it's up to you to judge it.