Arbor SIMD Library Refactoring
Motivation
Eventually factor out Arbor SIMD into a separate project and make it useful for the users outside of Arbor.
Current state
Arbor SIMD provides an API that is a variation of std::experimental::simd API. It is neither subset or superset of experimental.
Distinctive features of Arbor SIMD:
-
gather/scattersupport. -
SVE backend.
Observations
- According to N4808,
std::experimental::simdprovides explicit conversions from/to underlined type:explicit operator implementation-defined () const; explicit simd(const implementation-defined &); - SVE doesn't fit
std::experimetal::simdbackend model fundamentally
Proposal
Let us split Arbor SIMD into two libraries:
-
arbor-simd-indirect. It will depend on
std::experimental::simdand will providegather/scatterAPI in the form of free functions that acceptstd::experimental::simdparameters. Based on the compilation target, scalar type and width it will dispatch to the proper intrinsic, usingstatic_cast's to do simd wrapping/unwrapping. The library will not supportSVE. -
arbor-simd-sve. It will depend on
std::experimental::simdandarbor-simd-indirectand will provide adapted to SVE simd API. This API will consist of makers that return vectors (like:arbsve::broadcast(42)orarbsve::copy_from(ptr)) and functions that accept vectors. If the compilation target has sve intrinsics implemetation will forward directly to them, otherwise it will fall back tostd::experimental::simd+arbor-simd-indirect.
Thanks for the proposal @antonf.
I feel that refactoring to use std::experimental::simd is impractical while not part of the standard.
- it is available only in gcc 11, while the minimum version required by Arbor is GCC 8 (and Clang)
- we need to understand the performance tradeoffs, and check support for features AVX512 in
std::experimentalimplementation. For this we would have to conduct performance benchmarks.
As a rule in Arbor, we have implemented future standard library features internally, and used them when they can be replaced by mature implementations in our minimum compiler versions. Given this, I think it is too early to refactor the SIMD library to be based around std::experimental.
I certainly like the idea of splitting out the SVE side; it's really incompatible with the rest of the API.
Regarding std::experimental::simd:
- We could still factor our SIMD library into something that accords to the
std::experimental::simdinterface, and an additional component that supports the gather/scatter/constraint semantics, with a view to swapping over to the standard implementation in the future. - N4808, §9.7.7 provides
cmathoverloads for SIMD values; we can provide our own implementations with consistent numerics across back-ends under e.g.arb::math, both for SIMD and scalar values. Their optimized implementations though use low-level intrinsics rather than just the arithmetic operations provided bystd::experimental::simd.
For our implementations of e.g. expm1, exprelr etc. which rely upon decomposition of the mantissa and exponent and such, we could implement a set of architecture-specific low-level operations which are then used within our generic implementations, or stick to writing things in terms of standard decomposition functions and arithmetic. The former would allow us to maintain (mostly) the performance; the latter could well be slower, but might allow us an implementation that is more easily robust (proper support for subnormal numbers, etc.).
Hi, just happened across this issue while searching. Have you seen https://github.com/google/highway ? It's a C++ wrapper over intrinsics that supports SVE, RISC-V, AVX-512 and others. Would be happy to discuss if you're interested.
Hi @jan-wassenberg,
thanks for the suggestion. Highway looks pretty interesting, but it's unlikely we'll change our SIMD backend soon without pressing need. (RISC-V might pose such a need in the future) Just out of curiosity, how does highway compare to VC2 (https://github.com/vectorclass/version2)?
Just to note our requirements (mostly in terms of performant operations, since this is the motivator) not only to highway, but any other choice as well
- scatter store/gather load
- fast approximate mathematical functions: exp, pow, sqrt, log
- to a lesser degree: sin, cos, ...
Hi @thorstenhater , got it. Yes, RISC-V looks to be gathering momentum.
how does highway compare to VC2
I very much respect Agner's work but he is clear that no instruction sets other than x86 will be supported.
Just to note our requirements
Good to know. We have all of those except pow, and can help add that or other math functions if required. (For pow it really depends how much accuracy you want. A simple version can use log+exp already.)