curve25519-dalek NEON backend for aarch64

As promised in #449! This is joint work with @Tarinn. In fact, @Tarinn did most of the work :-)

Fixes #147 for Aarch64, and you'll see an ARMv7 PR coming in after this.

I'm working on benchmarks across several devices now (including an unpublished very hacky armv7 version), will update here when they become available. We're seeing speedups of 20-30% in relevant benchmarks. This is the same code as in https://github.com/zkcrypto/curve25519-dalek-ng/pull/19.

I'll be cleaning up the commits in the next few minutes (trailing white space fixes, mostly), and I'll run a comparison benchmark for 343be3a, because that was only introduced for future ARMv7 support, and has not been tested under load.

TODO:

[x] Cleanup whitespace and commits
[x] Test shuffle! call vs vqtbx1q_u8 performance
[x] Add Aarch64 cross build to CI
[ ] Figure out how-the-💥 we get rid of the ARMv8 TBL/TBX issue
[ ] Add NEON documentation to README Explain that ARMv7 is not yet supported

Dec 08 '22 12:12 rubdos

The shuffle! macro use in 26c494f seems to win 1.5% to 3% over the use of the vqtbx1q_u8 intrinsic on a Raspberry Pi 4. Seems like the LLVM compiler people know what they're doing.

I'm testing another point where we can use the shuffle! macro instead of some manual lane swapping, will report back.

Dec 08 '22 14:12 rubdos

2d343e9 does not have an influence on performance, but the code is quite a bit cleaner, so let's leave it in.

Dec 08 '22 15:12 rubdos

@rubdos we're working off the releases/4.0 branch for development, so you'll definitely want to rebase this

Dec 08 '22 19:12 tarcieri

Don't forget to make sure it's still constant-time! I don't actually know what that entails; it is not something I've looked into before. But a good amount of effort was put into the initial version of this crate using constant-time operations, and new backends should adhere to that too.

Dec 12 '22 18:12 jrose-signal

@jrose-signal Indeed, that's pretty important. As far as @Tarinn and I know, this code should be constant time. It is a direct port of the AVX2 work, and there should be no time-dependent calls in there. The most sketchy call that I could see is the shuffle!, but that does not take secret-dependent indexes.

I'm not sure how this is generally done. Should we ask some independent reviewer to have a look at the code?

Dec 13 '22 14:12 rubdos

fyi - we're continuing all development via in main -

Could this be re-based to main which is basically from release/4.0 - Thanks!

Also would we have any benchmarks between u64, neon and fiat ? e.g. a table like this would be great: https://github.com/pinkforest/ed25519-dalek/tree/docs-docsrs

For constant time, e.g. subtle CsOption etc.

Dec 13 '22 14:12 pinkforest

fyi - we're continuing all development via in main -

Could this be re-based to main which is basically from release/4.0 - Thanks!

Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.

Do we have any benchmarks between u64, neon and fiat ? e.g. a table like this would be great: https://github.com/pinkforest/ed25519-dalek/tree/docs-docsrs

I'm working on this across a bunch of devices (including armv7). I'd like these results to appear in a small publication, but I'll certainly dump a summary in the README when I have them. I should have something by the end of the week.

Dec 13 '22 14:12 rubdos

Of course - Just to make sure: is that going to remain? I did the rebase four days ago on release/4.0.

Yeah this is going to stay in main

#466

I'm working on this across a bunch of devices

Awesome! :partying_face: would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too: https://github.com/rust-lang/this-week-in-rust/

Dec 13 '22 14:12 pinkforest

Awesome! partying_face would love to see a link to write-up in This-Week-in-Rust TWiR if you can share there too: https://github.com/rust-lang/this-week-in-rust/

After publication, I'll make sure to make a huge amount of noise about it. That should be pretty short notice, I have some very strict deadlines about that now. Just to clarify: the deadline is there because this work is complete; there was no pressure on completion of this work that would have affected the quality.

Dec 13 '22 14:12 rubdos

Instead of flooding some changes here is some .patches:

src/backend/vector/neon/field.rs

diff --git a/src/backend/vector/neon/field.rs b/src/backend/vector/neon/field.rs
index 5cf57ee..85630cd 100644
--- a/src/backend/vector/neon/field.rs
+++ b/src/backend/vector/neon/field.rs
@@ -21,7 +21,7 @@
 //! arm instructions.
 
 use core::ops::{Add, Mul, Neg};
-use packed_simd::{u32x4, u32x2, i32x4, u8x16, u64x4, u64x2, IntoBits};
+use packed_simd::{u32x4, u32x2, i32x4, u64x4, u64x2, IntoBits};
 
 use crate::backend::vector::neon::constants::{P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO};
 use crate::backend::serial::u64::field::FieldElement51;
@@ -141,7 +141,7 @@ impl ConditionallySelectable for FieldElement2625x4 {
 impl FieldElement2625x4 {
 
     pub fn split(&self) -> [FieldElement51; 4] {
-        let mut out = [FieldElement51::zero(); 4];
+        let mut out = [FieldElement51::ZERO; 4];
         for i in 0..5 {
             let a_2i   = self.0[i].0.extract(0) as u64;
             let b_2i   = self.0[i].0.extract(1) as u64;
@@ -367,7 +367,9 @@ impl FieldElement2625x4 {
     #[inline]
     fn reduce64(mut z: [(u64x2, u64x2); 10]) -> FieldElement2625x4 {
 
+        #[allow(non_snake_case)]
         let LOW_25_BITS: u64x2 = u64x2::splat((1 << 25) - 1);
+        #[allow(non_snake_case)]
         let LOW_26_BITS: u64x2 = u64x2::splat((1 << 26) - 1);
 
         let carry = |z: &mut [(u64x2, u64x2); 10], i: usize| {
@@ -419,6 +421,7 @@ impl FieldElement2625x4 {
         ])
     }
 
+    #[allow(non_snake_case)]
     pub fn square_and_negate_D(&self) -> FieldElement2625x4 {
         #[inline(always)]
         fn m(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> u64x4 {
@@ -463,16 +466,16 @@ impl FieldElement2625x4 {
         let x8_19  = m_lo(v19, x8);
         let x9_19  = m_lo(v19, x9);
 
-        let mut z0 = m(x0,  x0) + m(x2_2,x8_19) + m(x4_2,x6_19) + ((m(x1_2,x9_19) +  m(x3_2,x7_19) +    m(x5,x5_19)) << 1);
-        let mut z1 = m(x0_2,x1) + m(x3_2,x8_19) + m(x5_2,x6_19) +                  ((m(x2,x9_19)   +    m(x4,x7_19)) << 1);
-        let mut z2 = m(x0_2,x2) + m(x1_2,x1)    + m(x4_2,x8_19) + m(x6,x6_19)    + ((m(x3_2,x9_19) +  m(x5_2,x7_19)) << 1);
-        let mut z3 = m(x0_2,x3) + m(x1_2,x2)    + m(x5_2,x8_19) +                  ((m(x4,x9_19)   +    m(x6,x7_19)) << 1);
-        let mut z4 = m(x0_2,x4) + m(x1_2,x3_2)  + m(x2,  x2)    + m(x6_2,x8_19)  + ((m(x5_2,x9_19) +    m(x7,x7_19)) << 1);
-        let mut z5 = m(x0_2,x5) + m(x1_2,x4)    + m(x2_2,x3)    + m(x7_2,x8_19)                    +  ((m(x6,x9_19)) << 1);
-        let mut z6 = m(x0_2,x6) + m(x1_2,x5_2)  + m(x2_2,x4)    + m(x3_2,x3) + m(x8,x8_19)        + ((m(x7_2,x9_19)) << 1);
-        let mut z7 = m(x0_2,x7) + m(x1_2,x6)    + m(x2_2,x5)    + m(x3_2,x4)                      +   ((m(x8,x9_19)) << 1);
-        let mut z8 = m(x0_2,x8) + m(x1_2,x7_2)  + m(x2_2,x6)    + m(x3_2,x5_2) + m(x4,x4)         +   ((m(x9,x9_19)) << 1);
-        let mut z9 = m(x0_2,x9) + m(x1_2,x8)    + m(x2_2,x7)    + m(x3_2,x6) + m(x4_2,x5);
+        let z0 = m(x0,  x0) + m(x2_2,x8_19) + m(x4_2,x6_19) + ((m(x1_2,x9_19) +  m(x3_2,x7_19) +    m(x5,x5_19)) << 1);
+        let z1 = m(x0_2,x1) + m(x3_2,x8_19) + m(x5_2,x6_19) +                  ((m(x2,x9_19)   +    m(x4,x7_19)) << 1);
+        let z2 = m(x0_2,x2) + m(x1_2,x1)    + m(x4_2,x8_19) + m(x6,x6_19)    + ((m(x3_2,x9_19) +  m(x5_2,x7_19)) << 1);
+        let z3 = m(x0_2,x3) + m(x1_2,x2)    + m(x5_2,x8_19) +                  ((m(x4,x9_19)   +    m(x6,x7_19)) << 1);
+        let z4 = m(x0_2,x4) + m(x1_2,x3_2)  + m(x2,  x2)    + m(x6_2,x8_19)  + ((m(x5_2,x9_19) +    m(x7,x7_19)) << 1);
+        let z5 = m(x0_2,x5) + m(x1_2,x4)    + m(x2_2,x3)    + m(x7_2,x8_19)                    +  ((m(x6,x9_19)) << 1);
+        let z6 = m(x0_2,x6) + m(x1_2,x5_2)  + m(x2_2,x4)    + m(x3_2,x3) + m(x8,x8_19)        + ((m(x7_2,x9_19)) << 1);
+        let z7 = m(x0_2,x7) + m(x1_2,x6)    + m(x2_2,x5)    + m(x3_2,x4)                      +   ((m(x8,x9_19)) << 1);
+        let z8 = m(x0_2,x8) + m(x1_2,x7_2)  + m(x2_2,x6)    + m(x3_2,x5_2) + m(x4,x4)         +   ((m(x9,x9_19)) << 1);
+        let z9 = m(x0_2,x9) + m(x1_2,x8)    + m(x2_2,x7)    + m(x3_2,x6) + m(x4_2,x5);
 
 
         let low__p37 = u64x4::splat(0x3ffffed << 37);
@@ -675,7 +678,7 @@ mod test {
 
     #[test]
     fn scale_by_curve_constants() {
-        let mut x = FieldElement2625x4::splat(&FieldElement51::one());
+        let mut x = FieldElement2625x4::splat(&FieldElement51::ONE);
 
         x = x * (121666, 121666, 2*121666, 2*121665);

src/backend/vector/neon/edwards.rs

diff --git a/src/backend/vector/neon/edwards.rs b/src/backend/vector/neon/edwards.rs
index 8073c3a..357d4e3 100644
--- a/src/backend/vector/neon/edwards.rs
+++ b/src/backend/vector/neon/edwards.rs
@@ -321,14 +321,14 @@ mod test {
     use super::*;
 
     fn serial_add(P: edwards::EdwardsPoint, Q: edwards::EdwardsPoint) -> edwards::EdwardsPoint {
-        use backend::serial::u64::field::FieldElement51;
+        use crate::backend::serial::u64::field::FieldElement51;
 
         let (X1, Y1, Z1, T1) = (P.X, P.Y, P.Z, P.T);
         let (X2, Y2, Z2, T2) = (Q.X, Q.Y, Q.Z, Q.T);
 
         macro_rules! print_var {
             ($x:ident) => {
-                println!("{} = {:?}", stringify!($x), $x.to_bytes());
+                println!("{} = {:?}", stringify!($x), $x.as_bytes());
             };
         }
 
@@ -411,8 +411,8 @@ mod test {
 
     #[test]
     fn vector_addition_vs_serial_addition_vs_edwards_extendedpoint() {
-        use constants;
-        use scalar::Scalar;
+        use crate::constants;
+        use crate::scalar::Scalar;
 
         println!("Testing id +- id");
         let P = edwards::EdwardsPoint::identity();
@@ -440,7 +440,7 @@ mod test {
 
         macro_rules! print_var {
             ($x:ident) => {
-                println!("{} = {:?}", stringify!($x), $x.to_bytes());
+                println!("{} = {:?}", stringify!($x), $x.as_bytes());
             };
         }
 
@@ -498,8 +498,8 @@ mod test {
 
     #[test]
     fn vector_doubling_vs_serial_doubling_vs_edwards_extendedpoint() {
-        use constants;
-        use scalar::Scalar;
+        use crate::constants;
+        use crate::scalar::Scalar;
 
         println!("Testing [2]id");
         let P = edwards::EdwardsPoint::identity();
@@ -516,8 +516,8 @@ mod test {
 
     #[test]
     fn basepoint_odd_lookup_table_verify() {
-        use constants;
-        use backend::vector::neon::constants::{BASEPOINT_ODD_LOOKUP_TABLE};
+        use crate::constants;
+        use crate::backend::vector::neon::constants::{BASEPOINT_ODD_LOOKUP_TABLE};
 
         let basepoint_odd_table = NafLookupTable8::<CachedPoint>::from(&constants::ED25519_BASEPOINT_POINT);
         println!("Testing basepoint table");

clippy is still whining but I'll send a .patch later - we are using the 2021 edition

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly build --features rand_core

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly clippy --features rand_core

$ env RUSTFLAGS='-C target_feature=+neon --cfg curve25519_dalek_backend="simd"' cargo +nightly bench --features rand_core

Dec 13 '22 18:12 pinkforest

I've done the rebase together with your patches, and squashed the NEON work into the first commit. I had not tested the rebase on 4.0, sorry about that. I'll do that now.

Dec 14 '22 09:12 rubdos

Working on the rustfmt attributes now.

Dec 14 '22 09:12 rubdos

started a benchmark repo here -

I ran some benchmarks earlier (will re-run now) on Mac M1 Max Aarch64 comparing between backends FWIW

I'm going to re-organise it to cover wider range benchmarks between various impl's not just dalek's

Dec 14 '22 09:12 pinkforest

I'll see whether I'll have all the data in that format. I currently only retain the raw outputs from Criterion for further automatic processing for our paper.

Dec 14 '22 10:12 rubdos

This took a bit longer than anticipated. We got a publication out, so academia-wise, we're all happy now. In between writing that paper and getting it out somewhere, there was quite some time, but I'm trying to sum up the current remaining issues with this branch in this post.

Results are especially spectacular on ARMv7/32-bit ARM CPUs (~50% speedups). Results are still great on ARMv8/aarch64 (~20%) until ARMv8.2-ish: when run in A64 mode, we get slow-downs relative to base. This is because ARM decided to nerf the TBL/TBX instructions relative to previous versions. The throughput remained the same, but the total latency of the instruction is now a function of the input size.

(shamelessly stolen from https://www.stonybrook.edu/commcms/ookami/support/_docs/A64FX_Microarchitecture_Manual_en_1.3.pdf)

We'll require a bit of in-depth looking into the assembly to figure out a way around this. The TBL/TBX instruction comes from various vqtbx1q_u8 calls in FieldElement2625x4::blend. @Tarinn made a note that we can probably use a combination of vset/vget to get blend faster, but we don't currently have the hardware to reproduce this.

We sent some executables to @direc85 over in Finland to get the benchmark run on an Xperia 10 III, which is how we noted the issue. That's not really a tight debuggable loop. I should soon™ have an Xperia 10 IV to close the loop a bit.

Feb 26 '24 10:02 rubdos

(that's a disfunctional rebase, we still have to reintegrate in the new backend-selection)

Feb 28 '24 12:02 rubdos

Awesome -

For CI there is ARM already for cross - .github/workflows/cross.yml#L19

We also have simd/x86_64-AVX related tests in .github/worksflows/curve25519-dalek.yml#L118

But I think we could split backend tests into a separate file - qemu should support neon at this level - backends.yml for the sake of clarity and discoverability.

I'm re-working the build script so I could include a placeholder separately there so it doesn't cause conflict.

https://github.com/dalek-cryptography/curve25519-dalek/pull/636

Also nightly CI is currently failing - I've fixed it here:

https://github.com/dalek-cryptography/curve25519-dalek/pull/638

Feb 28 '24 12:02 pinkforest

Today I idly tested this on my M1 Mac and the current state of the branch is a regression from the u64 backend. Changing the implementation of blend to use core::simd::Mask::select (with constant masks) got it back to about par with the u64 backend. I wasn't being super scientific with this (I was doing other things while the benchmark ran, for example, though nothing CPU-intensive), but it was a big enough gap to be indicative. There's probably something better than Mask, but just confirming that the TBL slowdown is serious. :sob:

Mar 22 '24 19:03 jrose-signal

There's probably something better than Mask, but just confirming that the TBL slowdown is serious. 😭

Yep, it's quite devastating indeed... If we've got news on that front, we'll post our findings as soon as we have them. Thanks for testing it out! :heart:

Mar 23 '24 07:03 rubdos

@Tarinn made some progress by now on the aarch64 problem, but we're not there yet.

We thought we would have some speedup by working around llvm/llvm-project/issues/58323 (some vget instrinsics don't get correctly lowered to DUP/MOV instructions), but manually throwing in the assembly didn't give a measurable improvement on our Odroid N2+.

The manual injection of assembly for `vget_high_u32`/`vget_low_u32`

diff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..316963d 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
 use core::arch::aarch64::{self, vuzp1_u32};
 use core::ops::{Add, Mul, Neg};
 
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
 use crate::backend::serial::u64::field::FieldElement51;
 use crate::backend::vector::neon::constants::{
     P_TIMES_16_HI, P_TIMES_16_LO, P_TIMpatchdiff --git a/curve25519-dalek/src/backend/vector/neon/field.rs b/curve25519-dalek/src/backend/vector/neon/field.rs
index 2f8d42c..cccde8c 100644
--- a/curve25519-dalek/src/backend/vector/neon/field.rs
+++ b/curve25519-dalek/src/backend/vector/neon/field.rs
@@ -23,20 +23,57 @@
 use core::arch::aarch64::{self, vuzp1_u32};
 use core::ops::{Add, Mul, Neg};
 
-use super::packed_simd::{u32x2, u32x4, i32x4, u64x2, u64x4};
+use super::packed_simd::{i32x4, u32x2, u32x4, u64x2, u64x4};
 use crate::backend::serial::u64::field::FieldElement51;
 use crate::backend::vector::neon::constants::{
     P_TIMES_16_HI, P_TIMES_16_LO, P_TIMES_2_HI, P_TIMES_2_LO,
 };
 
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_high_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+    use core::arch::asm;
+    let o;
+    unsafe {
+        asm! (
+            "DUP {o:d}, {v}.D[1]",
+            v = in(vreg) v,
+            o = out(vreg) o,
+        )
+    }
+    o
+}
+
+#[cfg(all(target_arch = "aarch64"))]
+#[inline(always)]
+fn vget_low_u32(v: core::arch::aarch64::uint32x4_t) -> core::arch::aarch64::uint32x2_t {
+    use core::arch::asm;
+    let o;
+    unsafe {
+        asm! (
+            "DUP {o:d}, {v}.D[0]",
+            v = in(vreg) v,
+            o = out(vreg) o,
+        )
+    }
+    o
+}
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_high_u32;
+#[cfg(not(target_arch = "aarch64"))]
+use core::arch::aarch64::vget_low_u32;
+
 macro_rules! shuffle {
     ($vec0:expr, $vec1:expr, $index:expr) => {
         unsafe {
             core::mem::transmute::<[u32; 4], u32x4>(
                 *core::simd::simd_swizzle!(
-                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)), 
-                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)), 
-                    $index).as_array())
+                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec0)),
+                    core::simd::Simd::from_array(core::mem::transmute::<u32x4, [u32; 4]>($vec1)),
+                    $index
+                )
+                .as_array(),
+            )
         }
     };
 }
@@ -53,8 +90,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
     let b0: u32x2;
     let b1: u32x2;
     unsafe {
-        use core::arch::aarch64::vget_high_u32;
-        use core::arch::aarch64::vget_low_u32;
         a0 = vget_low_u32(src.0.into()).into();
         a1 = vget_low_u32(src.1.into()).into();
         b0 = vget_high_u32(src.0.into()).into();
@@ -72,7 +107,6 @@ fn unpack_pair(src: (u32x4, u32x4)) -> ((u32x2, u32x2), (u32x2, u32x2)) {
 fn repack_pair(x: (u32x4, u32x4), y: (u32x4, u32x4)) -> (u32x4, u32x4) {
     unsafe {
         use core::arch::aarch64::vcombine_u32;
-        use core::arch::aarch64::vget_low_u32;
         use core::arch::aarch64::vgetq_lane_u32;
         use core::arch::aarch64::vset_lane_u32;
 
@@ -223,7 +257,6 @@ impl FieldElement2625x4 {
         self.shuffle(Shuffle::BACD)
     }
 
-
     // Can probably be sped up using multiple vset/vget instead of table
     #[inline]
     pub fn blend(&self, other: FieldElement2625x4, control: Lanes) -> FieldElement2625x4 {
@@ -318,8 +351,6 @@ impl FieldElement2625x4 {
         let rotated_carryout = |v: (u32x4, u32x4)| -> (u32x4, u32x4) {
             unsafe {
                 use core::arch::aarch64::vcombine_u32;
-                use core::arch::aarch64::vget_high_u32;
-                use core::arch::aarch64::vget_low_u32;
                 use core::arch::aarch64::vqshlq_u32;
 
                 let c: (u32x4, u32x4) = (
@@ -327,16 +358,8 @@ impl FieldElement2625x4 {
                     vqshlq_u32(v.1.into(), shifts.1.into()).into(),
                 );
                 (
-                    vcombine_u32(
-                        vget_high_u32(c.0.into()),
-                        vget_low_u32(c.0.into()),
-                    )
-                    .into(),
-                    vcombine_u32(
-                        vget_high_u32(c.1.into()),
-                        vget_low_u32(c.1.into()),
-                    )
-                    .into(),
+                    vcombine_u32(vget_high_u32(c.0.into()), vget_low_u32(c.0.into())).into(),
+                    vcombine_u32(vget_high_u32(c.1.into()), vget_low_u32(c.1.into())).into(),
                 )
             }
         };
@@ -344,19 +367,9 @@ impl FieldElement2625x4 {
         let combine = |v_lo: (u32x4, u32x4), v_hi: (u32x4, u32x4)| -> (u32x4, u32x4) {
             unsafe {
                 use core::arch::aarch64::vcombine_u32;
-                use core::arch::aarch64::vget_high_u32;
-                use core::arch::aarch64::vget_low_u32;
                 (
-                    vcombine_u32(
-                        vget_low_u32(v_lo.0.into()),
-                        vget_high_u32(v_hi.0.into()),
-                    )
-                    .into(),
-                    vcombine_u32(
-                        vget_low_u32(v_lo.1.into()),
-                        vget_high_u32(v_hi.1.into()),
-                    )
-                    .into(),
+                    vcombine_u32(vget_low_u32(v_lo.0.into()), vget_high_u32(v_hi.0.into())).into(),
+                    vcombine_u32(vget_low_u32(v_lo.1.into()), vget_high_u32(v_hi.1.into())).into(),
                 )
             }
         };
@@ -386,7 +399,6 @@ impl FieldElement2625x4 {
         #[rustfmt::skip] // Retain formatting of return tuple
         let c9_19: (u32x4, u32x4) = unsafe {
             use core::arch::aarch64::vcombine_u32;
-            use core::arch::aarch64::vget_low_u32;
             use core::arch::aarch64::vmulq_n_u32;
 
             let c9_19_spread: (u32x4, u32x4) = (
@@ -475,8 +487,6 @@ impl FieldElement2625x4 {
         fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
             use core::arch::aarch64::vmull_u32;
             use core::arch::aarch64::vuzp1_u32;
-            use core::arch::aarch64::vget_low_u32;
-            use core::arch::aarch64::vget_high_u32;
             unsafe {
                 let x: (u32x4, u32x4) = (
                     vmull_u32(x.0.into(), y.0.into()).into(),
@@ -530,8 +540,6 @@ impl FieldElement2625x4 {
 
         let negate_D = |x_01: u64x4, p_01: u64x4| -> (u64x2, u64x2) {
             unsafe {
-                use core::arch::aarch64::vget_low_u32;
-                use core::arch::aarch64::vget_high_u32;
                 use core::arch::aarch64::vcombine_u32;
 
                 let x = x_01.0;
@@ -640,8 +648,6 @@ impl<'a, 'b> Mul<&'b FieldElement2625x4> for &'a FieldElement2625x4 {
         fn m_lo(x: (u32x2, u32x2), y: (u32x2, u32x2)) -> (u32x2, u32x2) {
             use core::arch::aarch64::vmull_u32;
             use core::arch::aarch64::vuzp1_u32;
-            use core::arch::aarch64::vget_low_u32;
-            use core::arch::aarch64::vget_high_u32;
             unsafe {
                 let x: (u32x4, u32x4) = (
                     vmull_u32(x.0.into(), y.0.into()).into(),
@@ -836,5 +842,3 @@ mod test {
         assert_eq!(x3, splits[3]);
     }
 }
-
-

May 22 '24 14:05 rubdos

Current state of this pull request gives speed-up on the previously problematic CPUs. It is less than the speed-up on older architectures, but still a minimum of ~5% and more often at least 10 to 12% on benchmarks that use the SIMD NEON back-end, in comparison to the serial back-end. The latest commits to this branch have mainly focused on refactoring to use internal arm types (such as uint64x2x2_t) and trying to avoid the problematic llvm instructions. I believe this makes the current branch usable, but more testing on different devices would be welcome. Maybe a patch of the vget_high intrinsic in llvm will result in even more speed-up, so that is something I will look into next.

Results of the benchmarks: target_a55.tar.gz

(merge conflicts should be resolved shortly)

Aug 21 '24 11:08 Tarinn