xsimd
xsimd copied to clipboard
[DISCUSSION] How to support Arm SVE
I'm investigating how to support Arm SVE/SVE2 [1] in xsimd.
SVE vector is size agnotics. The register size (lanes) is determinted at run time, and the according C/C++ type is size-less (incomplete type). E.g., SVE svint8_t maps to NEON int8x16_t but without size. sizeof(svint8_t) does not compile, and it's illegal to declare a member with type svint8_t inside a struct.
For this reason, looks xsimd core data structures xsimd_register and batch(derived from xsimd_register) do not support SVE, as xsimd_register struct contains a data member of the SIMD register type [2]. Code snippet pasted below.
struct simd_register<SCALAR_TYPE, ISA> \
{ \
using register_type = VECTOR_TYPE; \
register_type data; \
operator register_type() const noexcept { return data; } \
}; \
I studied google highway which supports size agnostic vectors (Arm SVE, RISCV RVV). AFAIK, highway only handles the SIMD register type, without holding a register value [3]. The API often requires an explicit register type argument. Below is a simplified highway example to add two arrays vertically.
void vsum_hwy(const int* x, const int* y, size_t size, int* z) {
// generate explicit vector_type from element type
const ScalableTag<int> vector_type;
// Lanes(d) generates runtime code for SVE/RVV, constexpr otherwise
for (size_t i = 0; i < size; i += Lanes(d)) {
// Load/Store requires explicit vector_type
auto vx = Load(vector_type, x + i);
auto vy = Load(vector_type, y + i);
auto vz = Add(vx, vy);
Store(vz, vector_type, z + i);
}
}
I would like to hear ideas on how to support SVE size-less types. Highway approach may be inspiring, but I don't think xsimd should follow it.
A simpler but not ideal way is to only support fixed size SVE, e.g., SVE-128, SVE-256, ... SVE-2048. User must recompile the code to match the vector size on target machine.
[1] https://developer.arm.com/documentation/102476/0100 [2] https://github.com/xtensor-stack/xsimd/blob/8.1.0/include/xsimd/types/xsimd_register.hpp#L40 [3] https://github.com/google/highway/blob/master/g3doc/impl_details.md#vectors-vs-tags
cc @JohanMabille , @serge-sans-paille , @pitrou , @guyuqi
Ouch. Perhaps xsimd_register, in the case of SVE, can instead be a (pointer, length) pair or similar?
I don't think there's a hard requirement on what xsimd_register should alias to. Theoretically, it could even be an std::vector :-)
The only problem I foresee is that xsimd::batch has size as a constexpr member, and I'm pretty sure that's incompatible with SVE. We can turn that into a method call, but I guess we would need it to no longer be constexpr, which probably has plenty of impact...
Another way around would be to make SVE a paramateric Arch, template <size_t LaneCount> struct sve as we already for fma3. That way we could have different SVE lane that coexist in the same code, but they would be hard-coded.
The only problem I foresee is that
xsimd::batchhassizeas a constexpr member, and I'm pretty sure that's incompatible with SVE. We can turn that into a method call, but I guess we would need it to no longer beconstexpr, which probably has plenty of impact...
It could be static constexpr for fixed-size ISAs and non-constexpr for runtime-sized ISAs?
Unfortunately not, because that would make some kernel fail to compile on some arch while successfully compile on others. Think
constexpr auto n = b.size;
for(size_t i = 0; i < 100; i += n)
b *= b;
This (relatively idiomatic) code would be valid on AVX2 and not on SVE, and that's something we want to avoid.
Perhaps
xsimd_register, in the case of SVE, can instead be a (pointer, length) pair or similar?
Not sure if compiler can generate optimal code in this situation. I think simd data is supposed to be directly in hardware register, not on stack or heap memory. I tested aliasing SVE register over a dummy buffer, looks it's fragile. https://godbolt.org/z/jxTerbhvq
Looks fixed size SVE is the only option. Compiler option -msve-vector-bits can be used to set SVE register size (requires gcc-10+ and clang-12+). https://godbolt.org/z/89dT361hn.
Give the complexity to support variable length vector (which may be not a very common use case in practice), fixed size SVE looks a reasonable compromise to me.
As it's non-trivial work, I would like to hear from the community if we think fixed size SVE is useful to xsimd. If so, I will propose a draft for review soon.
What would be the constraints exactly? If running on a 512-bit SVE CPU, would fixed-size 256-bit SVE be able to execute?
Also, in your example, can we not pass the bitsize directly instead of __ARM_FEATURE_SVE_BITS?
What would be the constraints exactly? If running on a 512-bit SVE CPU, would fixed-size 256-bit SVE be able to execute?
No. If SVE hardware is configured as 512 bits (normally set by firmware or OS at boot time), all SVE instrutions will think the register size is 512bits. For example, memory load/store are 512 bits. So the loop increment step of a fixeds-size 256 bit code will be wrong. And memory access may overflow the buffer.
Uh, so let's hope most implementations choose the same register size then :-(
Also, in your example, can we not pass the bitsize directly instead of
__ARM_FEATURE_SVE_BITS?
Not sure I understand the question.
We can set the bitsize directly, __ARM_FEATRURE_SVE_BITS is just a number, 128, 256, ....
But again it must match the hardware.
On Mon, Jul 11, 2022 at 07:56:50AM -0700, Yibo Cai wrote:
What would be the constraints exactly? If running on a 512-bit SVE CPU, would fixed-size 256-bit SVE be able to execute?No. If SVE hardware is configured as 512 bits (normally set by firmware or OS at boot time), all SVE instrutions will think the register size is 512bits. For example, memory load/store are 512 bits. So the loop increment step of a fixeds-size 256 bit code will be wrong. And memory access may overflow the buffer.
Can we deduce that register lane value at compile time, or is it pure run-time?
It's pure runtime value, though it can be detected by running some tiny code.
My question was whether one could write __attribute__((arm_sve_vector_bits(128))) directly without a specific compiler option. But apparently that doesn't work.
<source>:4:50: error: 'arm_sve_vector_bits' is only supported when '-msve-vector-bits=<bits>' is specified with a value of 128, 256, 512, 1024 or 2048.
typedef svint32_t svfixed_int32_t __attribute__((arm_sve_vector_bits(128)));
Also one can't set any other value that the one given on the command line. Does arm_sve_vector_bits just only serve as a compile-time check that the right SVE width was enabled?
<source>:4:50: error: invalid SVE vector size '128', must match value set by '-msve-vector-bits' ('256')
typedef svint32_t svfixed_int32_t __attribute__((arm_sve_vector_bits(128)));
I find clang has a nice document about arm_sve_vector_bits.
https://github.com/llvm/llvm-project/blob/a45dd3d8140eab78a4554484c2b0435582ee262a/clang/include/clang/Basic/AttrDocs.td#L6204-L6227
Another issue is how to setup CI jobs to verify SVE implementation. As we are running aarch64 tests on x86 machines with qemu-user, it can also be used to verify SVE.
# cross build SVE enabled aarch64 binary
$ aarch64-linux-gnu-g++-10 -march=armv8-a+sve test.c
# test on simulated aarch64 machine with SVE vector size = 256 bits
$ qemu-aarch64 -cpu max,sve256=on -L /usr/aarch64-linux-gnu/ ./a.out
# test on simulated aarch64 machine with SVE vector size = 512 bits
$ qemu-aarch64 -cpu max,sve512=on -L /usr/aarch64-linux-gnu/ ./a.out
NOTES: SVE requires qemu-3+, but current CI job runs on ubuntu18.04 which ships qemu-2.11. SVE job may need ubuntu20.04 (qemu-4.2). For toolchains, gcc-10+ and clang-12+ are necessary to build SVE code. SVE2 requires qemu-6.0, but it can be left to future.
Some updates. I've implemented SVE functions. All unit tests passed. Verified both on real hardware (Neovere-N2) and qemu emulator. Need to refine cmake and add CI job.
Code at https://github.com/cyb70289/xsimd/commit/84e28d8d7f3297101eafcc2ab02a515e22164979
@cyb70289 Sounds interesting! AFAIU, your approach requires setting the bit width at compile time?
Yes, must pass -msve-vector-bits=xxx, otherwise sve code is skipped.
I tested sve128/256/512.
I encountered a similar scenario while doing SIMD research work, not long ago and want to share the following observations. I apologize before hand for the verbose context.
The SVE spec supports vector operations for vector sizes ranging from 128 to 2048 in 128 increments (128 is the lowest power of 2 that is not a primitive type in today's common 64-bit archs). SVE was made to support different archs with different vector lengths without requiring changing your code, given that there is no right answer to the main questions: "What is the right vector length for HPC and chip designs?".
Currently, I do not know of any processor that truly supports agnostic vector extensions without vector length constraints. Also, not all Arm processors support the full SVE spec, so it will always boil down to the capabilities provided by the micro-architecture. Thanks to the change that xsimd made to make batch<T, A> be parameterized by arch instead of vector length, provides the flexibility to support SVE vector length variants that are not supported in other vector extensions. For example, consider a vector length of (640 = 512+128).
To answer the question in hand: How to support Arm SVE? The sensible approach would be exactly what @cyb70289 implemented in his branch, specialize implementations for the different SVE lengths of interest. This would keep xsimd's API consistent. The SVE length needs to be set at compile-time and detected via macros. A shortcoming of this is that a specialization would need to be made for all 128:128:2048 vector lengths, but for now choosing the common vector lengths should suffice.
Other modern SIMD libraries also follow this approach: e.g., EVE.
I need to study xsimd code more carefully to understand better how it could support SVE (and similar) at runtime or without recompilation.
A quick glance at Google highway and if I am reading the code correctly, only SVE-128/256 have constexpr Lanes(), for other vector lengths it is not. This implies that no code that invokes Lanes() can be constexpr.
Also, it seems their implementation (by design) only supports vector lengths of power of 2.