sve: Add ARM SVE compile support
This commit adds code to support SVE+SVE2. However, since I don't have any real hardware available, it is mostly guesswork.
Is it really sensible to merge this then? No support would seem to be better than broken support; or am I wrong about that?
I wanted to look into support for these extensions because I expect them to be available in all ARMv9 CPUs. I converted the PR to draft. Maybe someone wants to pick up the draft and is able to test it. This might be a good start. It's also the reason why I shared the code in this state.
I have a SVE ARM server available. Just did a simple testing on @jdemel 's branch as of commit 63ca7096affce2cac815ec1c229d74f21de35e35.
This is the build message.
-- Available machines: generic;neon;neonv8;sve;sve2
-- BUILD TYPE = RELEASE
-- Base cflags = -O3 -DNDEBUG -fcx-limited-range -Wall -Werror=incompatible-pointer-types -Werror=pointer-sign
-- BUILD INFO ::: generic ::: GNU ::: -O3 -DNDEBUG -fcx-limited-range -Wall -Werror=incompatible-pointer-types -Werror=pointer-sign
-- BUILD INFO ::: neon ::: GNU ::: -O3 -DNDEBUG -fcx-limited-range -Wall -Werror=incompatible-pointer-types -Werror=pointer-sign -funsafe-math-optimizations
-- BUILD INFO ::: neonv8 ::: GNU ::: -O3 -DNDEBUG -fcx-limited-range -Wall -Werror=incompatible-pointer-types -Werror=pointer-sign -funsafe-math-optimizations -funsafe-math-optimizations
-- BUILD INFO ::: sve ::: GNU ::: -O3 -DNDEBUG -fcx-limited-range -Wall -Werror=incompatible-pointer-types -Werror=pointer-sign -funsafe-math-optimizations -funsafe-math-optimizations -march=armv8-a+sve
-- BUILD INFO ::: sve2 ::: GNU ::: -O3 -DNDEBUG -fcx-limited-range -Wall -Werror=incompatible-pointer-types -Werror=pointer-sign -funsafe-math-optimizations -funsafe-math-optimizations -march=armv8-a+sve -march=armv8-a+sve2
The compiler did some autovectorization. I observed some SVE instructions in /build/lib/libvolk.so.3.1.2. Some snippets I observed are:
e2280: d37ff862 lsl x2, x3, #1
e2284: a422c1c0 ld2b {z0.b-z1.b}, p0/z, [x14, x2]
e2288: e40341e0 st1b {z0.b}, p0, [x15, x3]
e228c: 0430e3e3 incb x3
e2290: 25260c60 whilelo p0.b, w3, w6
e2294: 54ffff61 b.ne e2280 <volk_32f_8u_polarbutterflypuppet_32f_generic+0x1a90>
b9b40: 6594a800 scvtf z0.s, p2/m, z0.s
b9b44: 65b50482 fmla z2.s, p1/m, z4.s, z21.s
b9b48: 65b20001 fmla z1.s, p0/m, z0.s, z18.s
b9b4c: 65b30002 fmla z2.s, p0/m, z0.s, z19.s
b9b50: 8b070042 add x2, x2, x7
b9b54: 25631ca0 whilelo p0.h, x5, x3
b9b58: 54fffe01 b.ne b9b18 <volk_16i_32fc_dot_prod_32fc_generic+0x318>
make test suggests that 100% tests passed, 0 tests failed out of 148
I have access to an ARM server with SVE support (AWS Graviton3), and have been on the search for a suitable project to do for this ARM Developer lab project.
I would like to try to take a stab at adding support for SVE.. Im looking to use this to learn both about ARM SVE and performance engineering.
Given that it is SIMD just like ARM NEON, I can use reference commit 789fb4d800c1ca738bfcb5a2e76ff4b963df6e49 and this paper written by Nathan West to see how I can add support for SVE. There's also a learning path by ARM on how to port ARM NEON to SVE.
May I have your support to try? As Im somewhat inexperienced, I would like feedback and guidance from you along the way, but I believe I can work mostly independently.
@wjsota your comments slipped through. Thank you very much. If you're still interested, I'd suggest use the code here, add an implementation for something simple, like a multiplication and open a new PR. That'd be something we can discuss.