asuswrt-merlin.ng
asuswrt-merlin.ng copied to clipboard
A53 systems: enable FP, Neon and crypto in userspace
Apart from a couple of components, user space is not using Neon, and nothing is using hardware floating point.
Add the necessary switches to activate FP, Neon and crypto to EXTRACFLAGS
(which most components already use), or more specifically CPUCFLAGS
.
Generally this will mean the compiler will use hardware floating point instructions instead of calling library routines anywhere float
and double
are used. It will also sometimes make use of the Neon to do 64-bit integer arithmetic.
The compiler is unlikely to actually do any significant vectorisation using the Neon, and won't directly use the crypto.
However, by setting these flags, components with custom Neon/crypto support will be able to detect their presence.
As an example, Nettle can detect Neon from the flags. However that detection wasn't working as intended, so it's been corrected. It will now use its arm/v6 and arm/neon routines.
ffmpeg
's autodetection gets easily confused - as evidenced by the EXTRACFLAGSx
that nobbled an attempt to pass it in. Explicitly specify the cpu
parameter to help it out. It can make use of the floating point.
One cheat is to define __ARM_ARCH_7A__
manually. (The compiler will be defining __ARM_ARCH_8A__
). This is to support older code that might have a series of if/else tests but is not aware of __ARM_ARCH_8A__
. It's possible some component might be ARMv8-A aware and testing from oldest first, in which case that would be tripped up. But that doesn't leave them any worse off than the previous situation, where we had the tools set to ARMv7-A anyway. I haven't confirmed if this cheat is really necessary. Would need an extensive code audit.
Tested on RT-AX88U, although I've not exercised many of the components. It builds and the router works, at least.
The platform.mak stuff is a mess, but I think it's a bit less of a mess now than when I started - I've tidied up some unnecessary conditions. But maybe that causes too much divergence from upstream. Feedback welcome.
nettle building is broken.
make[7]: Entering directory '/home/merlin/amng.ax86/release/src/router/nettle'
siv-cmac-aes256.o.d:168: *** missing separator. Stop.
Does the ax88u build work for you, at least? That was the one tested - I admit to not compiling every other model.
(So did I mess up the PR preparation, or is this a model variant problem?)
Actually, error coming from a "o.d" file is weird. Result of an incremental change? Try just a make nettle-clean
?
Clean build of rt-ax86u in src-rt-5.02p1axhnd.675x of the tip of the PR branch works for me - at least it's got as far as generating libnettle.a, and is still running.
Clean build of rt-ax86u in src-rt-5.02p1axhnd.675x of the tip of the PR branch works for me - at least it's got as far as generating libnettle.a, and is still running.
Might have been a fluke caused by the parallel building of that component as it builds fine tonight, again doing with the same clean rsync from repo. If it happens again I will disable PARALLEL_BUILD for nettle.
Have you been able to measure any performance improvement from these changes to confirm these are actually doing anything? Keep in mind that userspace is compiled as 32-bit armv7, only the kernel is compiled as aarch64, and also only on the bcm490x. bcm675x like the RT-AX58U also use a 32-bit armv7 kernel.
A quick test on openssl (which is probably the main place where performance improvements would be really beneficial) showed no difference in performance.
The platform.mak stuff is a mess, but I think it's a bit less of a mess now than when I started - I've tidied up some unnecessary conditions. But maybe that causes too much divergence from upstream. Feedback welcome.
It's best not to do any cleanup to that file, because indeed it makes merging upstream changes quite difficult. I actually reverted a few of my own changes a few months ago as merging 45581 proved to be very difficult with all my previous changes in place.
Have you been able to measure any performance improvement from these changes to confirm these are actually doing anything? Keep in mind that userspace is compiled as 32-bit armv7, only the kernel is compiled as aarch64, and also only on the bcm490x. bcm675x like the RT-AX58U also use a 32-bit armv7 kernel.
User space is AArch32, but compiling it as "ARMv7A-with-no-floating-point-or-neon" is suboptimal when you're really "ARMv8A-with-floating-point-and-neon".
(Be careful with terminology - ARMv7-A is 32-bit only, but ARMv8-A can be 32-bit or 64-bit. Everywhere you said "armv7" there I think you meant "AArch32".)
I've not measured anything, but I've inspected code generation of various components. Anything using floats+doubles is a lot happier, and it's clear the compiler knows how to work on uint64_t
s with the Neon rather than 32 bits at a time with the standard instructions.
Unlike x86, there really isn't a night-and-day difference on base speed on general code between 32 and 64 bit. The 32-bit instruction set is already good. And it will give you access to the VFP, Neon and Crypto, as long as you tell people it's there.
Only significant limitation is that full use of Neon and Crypto needs hand-crafted assembler and people are increasingly not bothering to make 32-bit versions. (Part of the issue is that with Neon you've got the instructions, but only half the register file, which might mean a significantly different approach).
A quick test on openssl (which is probably the main place where performance improvements would be really beneficial) showed no difference in performance.
openssl was maybe the only userspace component that was already aware it had Neon - detected at runtime for assembly selection via HWCAPs. So it at least managed to run its AES-accelerated crypto, which is one of its biggest jobs. As most of its speed-critical work is handcrafted assembly selected via HWCAPs, the compiler flags won't affect it much.
Openssl does partially suffer from the "only handcrafted for 64-bit" problem. aesv8-armx.pl
can generate Crypto Extension code for either AArch32 or AArch64, but sha512-armv4.pl
has only Neon but not Crypto for AArch32.
I'd only expect you to see a crypto boost in something using nettle, which had failed to activate any of its armv6 or neon assembly.
Otherwise, this is a general hygiene patch - there's megabytes of general code being spat out, and it makes my teeth itch seeing it making library calls for floating point and doing 64-bit ops the hard way. (I was started on this by #803 - some curiosity made me look at the output for the prints, and I saw all the library calls.) I was hoping for a bit of a size reduction, but it doesn't seem to have been significant.
Oh, one question - this is what I pieced together for the CPU possibilities there are. Have I got this right?
RT-AC68U: 4709 (6.x.4708) 2x A9
---
RT-AC88U: 4709 (7.14.114.x) 2x A9
---
RT-AC86U: 4906 (5.02hnd) 2x A53
---
RT-AX56U: 6755 (5.02axhnd.675x) 4x A7 BRCM_CHIP=47622
RT-AX58U: 6750 (5.02axhnd.675x) 3x A7 BRCM_CHIP=63178n
v2: 6756
---
RT-AX88U: 49408 (5.02axhnd) 4x A53 BRCM_CHIP=4908
RT-AX86U: 4908 (5.02p1axhnd.675x) 4x A53 BRCM_CHIP=4908 (62118?) (also 86S - 2x A53 - same firmware)
RT-AX68U: 4906 (5.02p1axhnd.675x) 2x A53 BRCM_CHIP=4908 (62118?)
And I believe the A7 and A9 were base - no FP or Neon, while the A53s have all the trimmings.
And all the A53-based ones run a 64-bit kernel.
At a quick glance it looks about right. Earlier RT-AC68U used a 800 MHz bcm4708. Later revisions went 4709 (and then 4709c0).
One thing to note about the 490x, this is actually a B53. No idea what Broadcom changed over the original A53 design, but we can assume that the two are very similar.
Openssl does partially suffer from the "only handcrafted for 64-bit" problem. aesv8-armx.pl can generate Crypto Extension code for either AArch32 or AArch64, but sha512-armv4.pl has only Neon but not Crypto for AArch32.
is there something that could be optimized there? I only had a quick look at it over the weekend, but had trouble figuring out how the build system worked for openssl (last time I spent time on it, it was back in the 1.0.x days when I did a LOT of optimization there, gaining well over 10% in performance when running OpenVPN on an RT-N66U MIPS CPU).
Also did a good bit of work when trying to bring 1.1.1 back to the performance levels of 1.0.2, but with limited success (even started bisecting to figure out where 1.1.1 lost a lot of its performance). One thing I discovered then was for instance OpenSSL ran faster with -O2 than -O3 on my test router.
is there something that could be optimized there?
Surely, yes, but only if you're in the business of hand-coding ARM/Neon assembly. It's the upstream package that lacks aarch32+SHA256 crypto acceleration assembly.
I've done lots of ARM optimisation in my time, but Neon's new to me. Maybe if I get bored... Haven't figured out if the half-size Neon register file in 32-bit mode matters for SHA256. I suspect it may do. AES doesn't really care...
OpenSSL does at least have non-crypto-extension Neon SHA256. But that will probably be slower than the crypto extension AES-CBC, so it will possibly dominate overall crypto time.
If the OpenSSL manages to avoid that 300Mbit/s non-accelerated bottleneck, it's worth worrying about? Otherwise it's all faster than that anyway.
I can't see anything from a Makefile configuration point of view - I think OpenSSL is already basically optimally configured w.r.t the assembly it does have.
Compiler options in this patch shouldn't hurt, and maybe make -O3 worth revisiting? -O3 has a better chance of being beneficial if the compiler has a better idea of what the CPU is - and it can activate some automatic vectorisation with the Neon, in theory.
Oops, correction time. There is actually an aarch32 crypto extension sha256 implementation in there, I'd just missed it. It's tucked at the bottom of sha256-armv4.pl
...
I can't see anything from a Makefile configuration point of view - I think OpenSSL is already basically optimally configured w.r.t the assembly it does have.
Ok, that's what I wanted to know. I have to trust in the OpenSSL devs to know what they are doing at the asm level, I was just wondering if there were any performance left on the table due to us not using the correct platform profile.
Compiler options in this patch shouldn't hurt, and maybe make -O3 worth revisiting? -O3 has a better chance of being beneficial if the compiler has a better idea of what the CPU is - and it can activate some automatic vectorisation with the Neon, in theory.
My gut feeling tells me that the -O3 performance loss were due to unwrapped loops no longer fitting within the CPU cache.. Might be worth retesting at some point, however it's currently not a priority for me as I have other things that need my attention development-wise.
bcm6750 dose not support hwcrypto bcm6755/6756 bcm4906/4908/4912/4915 support hwcrypto all of them support fpu(neon/vfp3)
bcm6750 dose not support hwcrypto bcm6755/6756 bcm4906/4908/4912/4915 support hwcrypto all of them support fpu(neon/vfp3)
The 675x are all Cortex-A7 based, as far as I can see. I don't see how any of them can have ARM crypto extensions. (Although everything has Broadcom's own SPU accelerator).
It's possible some 675x have Neon - if so, I've not found it configured anywhere, and it would be worthwhile getting that done like this patch.
I've failed to find any /proc/cpuinfo
output via Google showing any FP/Neon on any Asus router with a 32-bit ARM myself. The A9-based devices did not have FP or Neon for sure. Can you show the output from a device that does? (It should show hardware capabilities regardless of what's enabled in the software).
https://github.com/SWRT-dev/softcenter_tools/blob/52ff614504526fb6d828fc3f4ddf6a40145e1faa/sc.sh#L22
Just build anything and try to run it
Asus prebuilt files are built with flags softfp or soft
, i don't think it's good idea to modify the build environment.
Thanks - the code there seems to suggest all 675x have Neon, which would make sense. And looking more closely at the Merlin kernel config, I can see it enabled in the 63178 kernel for the AX58U. I'd missed that.
So I think this patch can be extended to cover those systems - the benefit would be basically the same as for the A53 ones.
Changing the flags as per this PR won't affect binary compatibility - we stay using the soft/softfp ABI, so FP values are passed in integer registers to external functions. What wouldn't work in this system would be using -mfloat-abi=hard
as in your link.
What do you think is the crypto difference between 6750 and 6755? I'm pretty certain it can't be the ARM crypto extensions, unless we're wrong about those chips being Cortex-A7s. Could maybe be a Cortex-A32?
Great work!