Halide Support for ARM SVE2.

Heavily based on Steve Suzuki's work here: https://github.com/halide/Halide/pull/6781 . Hopefully easier to merge with less effect on existing ARM support and fewer constraints on CodeGen_LLVM.

Jan 29 '24 22:01 zvookin

I can't seem to name Steve Suzuki in the reviewers list. Apologies if I'm missing something.

Jan 29 '24 22:01 zvookin

looks like legit failure in correctness_vector_reductions on arm64

Feb 07 '24 19:02 steven-johnson

Passing all tests. How close are we to landing?

Feb 20 '24 20:02 steven-johnson

Why is the IR sharing issue only affecting this SIMD op check test, not others?

Feb 24 '24 02:02 zvookin

The other simd op check tests just use Exprs and some dummy calls to functions that get subbed out with per-test-case local imageparams, so there's no shared mutable IR. This test defines Funcs that are used by multiple test cases. It's the sharing of those Funcs that caused the problem.

Feb 24 '24 02:02 abadams

Though now that you say that, I guess we should move the deep copying code I added into the base class so that this isn't an issue elsewhere later.

Feb 24 '24 02:02 abadams

Is this ready to land? Are there remaining issues? Should I run a global presubmit inside Google first?

Feb 27 '24 02:02 steven-johnson

It should be ready to land, pending validation that it does not break anything in existing ARM support, so please run a global presubmit. We can discuss testing correctness one to one. My first attempt to test this under QEMU failed completely because FreeBSD on QEMU on Mac OS (M3) is an exercise in disk corruption.

Feb 27 '24 02:02 zvookin

Looks to be clean inside Google, are we ready to land?

Mar 06 '24 17:03 steven-johnson

I still need to benchmark it on our workloads.

Mar 06 '24 18:03 abadams

On average it makes no difference to our neon workloads, though neon codegen definitely changed in a bunch of places.

I did find one minor inefficiency while eyeballing things, which uncovered an inefficiency in main. Consider this pipeline:

#include "Halide.h"
using namespace Halide;

int main() {
    Func f, g;
    Var x;

    f(x) = x / 17.0f;
    f.compute_root();

    g(x) = {cast<int>(ceil(f(x))), cast<int>(floor(f(x)))};

    g.vectorize(x, 4);

    g.compile_to_assembly("/dev/stdout", {}, "", Target{"arm-64-linux-no_asserts-no_runtime-no_bounds_query"});

    return 0;
}

On both main and the branch this generates the inner loop:

	ldr	q0, [x9], #16
	subs	x8, x8, #1
	frintp	v1.4s, v0.4s
	frintm	v0.4s, v0.4s
	fcvtzs	v1.4s, v1.4s
	fcvtzs	v0.4s, v0.4s
	str	q1, [x10], #16
	str	q0, [x11], #16

It rounds up and rounds down as a float, then converts to an int. It should instead be using a different rounding mode on the int conversions and just using fcvtms and fcvtps

If you remove the .vectorize call then on main we get:

	ldr	s0, [x8], #4
	subs	x19, x19, #1
	fcvtps	w9, s0
	fcvtms	w10, s0
	str	w9, [x20], #4
	str	w10, [x21], #4

and on the branch we get:

	ldr	s0, [x8], #4
	subs	x19, x19, #1
	frintp	s1, s0
	frintm	s0, s0
	fcvtzs	v1.2s, v1.2s
	fcvtzs	v0.2s, v0.2s
	st1	{ v1.s }[0], [x20], #4
	st1	{ v0.s }[0], [x21], #4

It looks like the branch goes down the vector path even in the scalar case, and hits the issue we already have with vector floor/ceil.

For the vector case, it looks like the llvm IR is:

  %22 = tail call <4 x float> @llvm.floor.v4f32(<4 x float> %18)
  %23 = fptosi <4 x float> %22 to <4 x i32>

but we should instead be calling the intrinsic llvm.aarch64.neon.fcvtms.v4i32.v4f32

I'll open a PR onto this branch

Mar 12 '24 18:03 abadams

Oops, looks like I broke something. Will fix.

Mar 13 '24 21:03 abadams

Can't seem to repro locally, and in the llvm commit log there have been reverts of aarch64 stuff. I'll just do a merge with main and see what happens.

Mar 13 '24 22:03 abadams