Support for ARM SVE2.
Heavily based on Steve Suzuki's work here: https://github.com/halide/Halide/pull/6781 . Hopefully easier to merge with less effect on existing ARM support and fewer constraints on CodeGen_LLVM.
I can't seem to name Steve Suzuki in the reviewers list. Apologies if I'm missing something.
looks like legit failure in correctness_vector_reductions on arm64
Passing all tests. How close are we to landing?
Why is the IR sharing issue only affecting this SIMD op check test, not others?
The other simd op check tests just use Exprs and some dummy calls to functions that get subbed out with per-test-case local imageparams, so there's no shared mutable IR. This test defines Funcs that are used by multiple test cases. It's the sharing of those Funcs that caused the problem.
Though now that you say that, I guess we should move the deep copying code I added into the base class so that this isn't an issue elsewhere later.
Is this ready to land? Are there remaining issues? Should I run a global presubmit inside Google first?
It should be ready to land, pending validation that it does not break anything in existing ARM support, so please run a global presubmit. We can discuss testing correctness one to one. My first attempt to test this under QEMU failed completely because FreeBSD on QEMU on Mac OS (M3) is an exercise in disk corruption.
Looks to be clean inside Google, are we ready to land?
I still need to benchmark it on our workloads.
On average it makes no difference to our neon workloads, though neon codegen definitely changed in a bunch of places.
I did find one minor inefficiency while eyeballing things, which uncovered an inefficiency in main. Consider this pipeline:
#include "Halide.h"
using namespace Halide;
int main() {
Func f, g;
Var x;
f(x) = x / 17.0f;
f.compute_root();
g(x) = {cast<int>(ceil(f(x))), cast<int>(floor(f(x)))};
g.vectorize(x, 4);
g.compile_to_assembly("/dev/stdout", {}, "", Target{"arm-64-linux-no_asserts-no_runtime-no_bounds_query"});
return 0;
}
On both main and the branch this generates the inner loop:
ldr q0, [x9], #16
subs x8, x8, #1
frintp v1.4s, v0.4s
frintm v0.4s, v0.4s
fcvtzs v1.4s, v1.4s
fcvtzs v0.4s, v0.4s
str q1, [x10], #16
str q0, [x11], #16
It rounds up and rounds down as a float, then converts to an int. It should instead be using a different rounding mode on the int conversions and just using fcvtms and fcvtps
If you remove the .vectorize call then on main we get:
ldr s0, [x8], #4
subs x19, x19, #1
fcvtps w9, s0
fcvtms w10, s0
str w9, [x20], #4
str w10, [x21], #4
and on the branch we get:
ldr s0, [x8], #4
subs x19, x19, #1
frintp s1, s0
frintm s0, s0
fcvtzs v1.2s, v1.2s
fcvtzs v0.2s, v0.2s
st1 { v1.s }[0], [x20], #4
st1 { v0.s }[0], [x21], #4
It looks like the branch goes down the vector path even in the scalar case, and hits the issue we already have with vector floor/ceil.
For the vector case, it looks like the llvm IR is:
%22 = tail call <4 x float> @llvm.floor.v4f32(<4 x float> %18)
%23 = fptosi <4 x float> %22 to <4 x i32>
but we should instead be calling the intrinsic llvm.aarch64.neon.fcvtms.v4i32.v4f32
I'll open a PR onto this branch
Oops, looks like I broke something. Will fix.
Can't seem to repro locally, and in the llvm commit log there have been reverts of aarch64 stuff. I'll just do a merge with main and see what happens.