ginkgo
ginkgo copied to clipboard
WIP: SVE intrinsics implementation of CSR SpMV and Merge-SpMV algorithms
This PR provides two implementations of CSR SpMV ("traditional" and Merge-SpMV from https://github.com/dumerrill/merge-spmv/raw/master/merge-based-spmv-sc16-preprint.pdf ) using SVE intrinsics for double precision. PR is far from being integration ready, and it should be considered more of an example of how the implementation could look like. One should eventually also apply the suggestions from PR #1497 about RHS, integration (a->get_strategy()
), and OpenMP scheduling. To ease the testing, I put the current implementation in place of the OpenMP CSR SpMV, although it should probably be in a completely separate (completely new?) part of Ginkgo.
The motivation for having code with SVE intrinsics is performance. SVE intrinsics implementations can bring significantly better vectorization for Arm machines supporting SVE (Fujitsu A64FX, Amazon Graviton, Nvidia Grace...), since GCC auto-vectorization for CSR kernel seems to be poor. We have measured up to 80% performance improvements for bone010.mtx on Fujitsu A64FX and up to 36% improvements for thermal2.mtx on Amazon Graviton3 machine when using this implementation with SVE intrinsics.
Unlike AVX intrinsics, SVE allows vector length agnostic implementations which leads to a cleaner code. The code in the proposed PR works on both A64FX (512b vector length) and Graviton 3 (256b vector length).
On the other hand, AFAIK there is no easy way to deal with different datatypes (double, float, complex...), and one needs separate intrinsics implementations. The code for the proposed PR works only for double precision.
Finally, note that the OpenMP parallelization is commented out in the code. The reason behind this is the known internal bug of the GCC compiler ( https://gcc.gnu.org/bugzilla//show_bug.cgi?id=101018 ) which sometimes occurs when OpenMP pragmas are combined with SVE intrinsics. I hope that other compilers do not have this issue, and already committed fix to GCC is upstreamed soon. When this problem is fixed, one should simple uncomment OpenMP pragmas in this PR, and the code should work in parallel.