OpenBLAS
OpenBLAS copied to clipboard
Element ordering of inner GEMM kernels
I'm working on optimizing the inner GEMM kernels for RISC-V. I'm confused about the way the arrays are arranged once S/DGEMMKERNEL is called. The array ba[] and bb[] arguments seem to be arranged such that ba[] is row major, and bb[] is column major, turning the matrix-multiply into a series of dot products.
Furthermore, elements of ba[] and bb[] are rearranged, such that a 2x2 (or whatever size) block of elements is arranged contiguously. I guess this is to improve locality?
Is there anyway to write the inner kernel such that it receives A[] and B[] in the same arrangement, (both row major or both column major), without element reordering? The RISC-V vector implementation makes it very simple to perform GEMM if the operands are arranged in this manner, since loads and stores of long arrays stored contiguously are optimized for.
You can redefine anything you find in /common_macro.h
I suspect you would need to change driver/level3/level3.c for that, basically add an "#if defined(RISCV)" branch that skips all the rearranging of the arguments. Unfortunately the code is not really documented, the best we have is https://github.com/xianyi/OpenBLAS/wiki/Developer-manual (which just gives a brief overview of code organization and a link to K.Goto's original paper).
I see, thanks. I'm starting to realize that much of the code is very optimized for packed-SIMD optimizations. Unfortunately this makes optimizing the package for a vector architecture somewhat cumbersome.