stdarch NEON intrinsics are broken on big-endian

These are currently broken because the order of elements inside vectors is reversed on big-endian systems: the ARM ABI requires that element 0 is located at the highest address of the vector type. However LLVM intrinsics expect element 0 to be located at the lowest address.

See https://llvm.org/docs/BigEndianNEON.html and arm_neon.h in Clang for more details.

Oct 19 '23 15:10 Amanieu

the ARM ABI requires that element 0 is located at the highest address of the vector type. However LLVM intrinsics expect element 0 to be located at the lowest address.

What exactly does this mean? Is there a bug in LLVM? If so, where is it tracked?

Or is the problem that Rust stdarch wants to expose the intrinsics the way they work on hardware, but LLVM doesn't provide those semantics? If so, could that be fixed by doing appropriate translation of indices before calling the intrinsics?

Feb 13 '24 06:02 RalfJung

The short answer is that, on big-endian, LLVM portable vectors have a different element ordering than the one in the vector types used by the NEON intrinsics.

The C intrinsics work around this by reversing the element ordering in vectors before & after each intrinsic. We need to do the same in stdarch.

Feb 14 '24 01:02 Amanieu

Oh I see, so this is a mismatch about the simd_x intrinsics vs vendor-specific intrinsics? Okay makes sense.

OTOH this is good news for portable-simd, seems like there we'll be getting consistent behavior across platforms without extra work then.

Feb 14 '24 06:02 RalfJung

@he32

I currently appear unable to find the actual connecting tissue between library/stdarch/crates/core_arch/src/aarch64/neon/mod.rs and LLVM

The actual place that intrinsics themselves are handled is in two places: if it's an architecture-specific intrinsic, it uses link_llvm_intrinsics, which effectively specifies a lowering directly to LLVM textual IR. Otherwise, if it's one of rustc's "portable" intrinsics (simd_add and the like), the primary definition is in rustc_codegen_llvm: https://github.com/rust-lang/rust/blob/master/compiler/rustc_codegen_llvm/src/intrinsic.rs

Oct 07 '24 20:10 workingjubilee

As stated elsewhere, I reported the issue https://github.com/rust-lang/rust/issues/129819 earlier, and now have workarounds in place so that I'm able to produce a working rust compiler on big-endian NetBSD/aarch64 by avoiding attempts to use the NEON extensions in that mode.

However, since those extensions are available in the CPU, a better solution would be to fix this issue and then probably to revert the workarounds.

Since I'm a relative rust newbie, I have been thinking about what it would take to get some forward motion on that underlying issue. I have so far come to the conclusion that it would be helpful to have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system. Perhaps such a program already exists, and it's just a matter of pointing to it? My newbie status would make it difficult for me to come up with such a program, and I am hoping that it would be helpful in exploring a fix to this underlying issue. Since I do this in my copious spare time (as the expression goes), I can't make any firm commitments, but I think this will be a useful starting point for anyone wanting to tackle this issue properly.

The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode. I don't have a good intuition for that -- anyone have a better suggestion for that? Ideally, the test program could also do some measurement / validation of that? (Or would that be asking too much? It would not be required to act as the initial stepping stone, at least.)

Nov 10 '24 06:11 he32

The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode.

I assume LLVM has to insert its own byte swapping when lowering the portable SIMD operations on big-endian NEON... so hopefully the codegen backend is good enough to realize that the two swaps cancel each other out, and remove both of them? If not, that seems worth reporting as an LLVM bug.

have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system

The stdarch test suite should be able to serve that purpose. The tricky part probably is that we don't have a way to run it on CI. Miri can run some of it (when it only needs generic SIMD intrinsics), not sure if that is good enough to gain confidence for re-landing the intrinsics.

Nov 10 '24 09:11 RalfJung

I also found this gem in the LLVM docs mentioned above:

Make sure appropriate bitconverts are created so that vector values get passed over call boundaries as 1-element vectors (which is the same as if they were loaded with LDR).

Is that something the frontend has to do? That would mean we need a special case in our ABI handling code to use PassMode::Cast with a 1-element vector for all by-val vector passing on these targets. @workingjubilee is going to love this. ;)

Nov 10 '24 09:11 RalfJung

...whaaa? so a <16 x i8> becomes <1 x i128>?

Nov 10 '24 10:11 workingjubilee

That's how I understand this, yes.

Nov 10 '24 11:11 RalfJung

have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system

The stdarch test suite should be able to serve that purpose.

Hmm, then I need to go look there (pointer to directory?), and see if the NEON stuff is easily identifiable / isolateable.

The tricky part probably is that we don't have a way to run it on CI. Miri can run some of it (when it only needs generic SIMD intrinsics), not sure if that is good enough to gain confidence for re-landing the intrinsics.

I am "old school", so was thinking foremost of doing the development without any CI support. Getting CI in place for this would then be a separate issue to be tackled separately.

Nov 10 '24 13:11 he32

We shouldn't land anything without CI support, but ofc you can develop in whatever order suits you best. :)

Nov 10 '24 13:11 RalfJung

We shouldn't land anything without CI support, but ofc you can develop in whatever order suits you best. :)

I completely understand, and probably agree, and recall having seen hints which might help in that direction. We'll see. First things first.

Nov 10 '24 13:11 he32

The easiest way to support big-endian would be to migrate all NEON intrinsics to use the stdarch-gen code generation framework and then have that automatically insert the needed swizzles on big-endian. However this is a huge amount of work and not a trivial undertaking.

Another approach would be to adapt the new code generator used for SVE intrinsics in #1509 to also generate NEON intriniscs, which may be easier to work with than the current code generator. cc @JamieCunliffe

Nov 10 '24 13:11 Amanieu

The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode.

I assume LLVM has to insert its own byte swapping when lowering the portable SIMD operations on big-endian NEON... so hopefully the codegen backend is good enough to realize that the two swaps cancel each other out, and remove both of them? If not, that seems worth reporting as an LLVM bug.

Then I do not understand. As I understand it, doing swaps both pre- and post-SIMD operations are a necessary part of making the SIMD operations work as intended. They are therefore not "cancellable".

have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system

The stdarch test suite should be able to serve that purpose.

Sadly, this fails the "here is a concrete set of operations to test & validate" test. It's like saying to me "it is in there, somewhere, in the rust compiler sources -- you go figure out where and what by yourself".

Nov 10 '24 14:11 he32

Yeah the test suite is complicated I am afraid -- I was just giving you some pointers that hopefully lead into the right direction. That's all I can offer, sorry. I don't know how the stdarch test suite is set up. I'm afraid I don't think there is such a thing as a simple test program; stdarch offers many thousand operations and has some fancy setup to test them all.

Nov 10 '24 15:11 RalfJung

You can look at the intrinsic-test crate in this repo which checks that the intrinsics match the behavior of those same intrinsics in Clang. See our CI script for an example of how to run it.

Nov 10 '24 17:11 Amanieu

Yeah the test suite is complicated I am afraid -- I was just giving you some pointers that hopefully lead into the right direction. That's all I can offer, sorry. I don't know how the stdarch test suite is set up. I'm afraid I don't think there is such a thing as a simple test program; stdarch offers many thousand operations and has some fancy setup to test them all.

OK, I understand. I have found at least some of what needs to be looked at, and I'm trying to follow the various suggestions here. Next will need to be some experimentation etc. We'll see how that goes. Thanks anyway!

Nov 10 '24 19:11 he32

Did something recently happen in this space? Zerocopy's nightly CI started failing this past weekend due to SIMD intrinsic name resolution errors on aarch64_be-unknown-linux-gnu while building memchr. It doesn't look like memchr has changed; was rustc changed?

Nov 12 '24 16:11 jswrenn

@jswrenn That is because https://github.com/rust-lang/rust/pull/132714 bumped memchr without considering aarch64_be is not supported in the latest memchr (https://github.com/BurntSushi/memchr/pull/162).

(According to the author of that rust-lang/rust PR, the memchr bump might be able to be reverted: https://github.com/taiki-e/atomic-maybe-uninit/commit/6ea62ccd2664b8feb0200f6c4004d47d0e8446f3#commitcomment-148891566)

Nov 12 '24 16:11 taiki-e

FWIW I was going to look at getting this working once the new ARM generator PR has gone in

Dec 18 '24 15:12 Jamesbarford

@Jamesbarford Are you still intending to do this work? Just trying to get a sense of timeline so we know whether to wait for the fix or work around it on our end (in zerocopy - see e.g. this build failure).

Jan 27 '25 22:01 joshlf

@joshlf - this, amongst other things, is something I am currently working on. Timeline wise - I would hesitate to commit to anything concrete optimistically I hope fairly soon!

Jan 28 '25 13:01 Jamesbarford