asuswrt-merlin.ng
asuswrt-merlin.ng copied to clipboard
arm64 kernels: add accelerated crc32 routines
Incorporate changes from Linux 4.20/4.21 to accelerate the kernel's crc32_le and __crc32c_le helpers.
Incorporates:
9784d82db ("make core crc32() routines weak so they can be overridden") 7481cddf2 ("arm64/lib: add accelerated crc32 routines") efdb25efc ("arm64/lib: improve CRC32 performance for deep pipelines") ff98e20ef ("lib/crc32.c: mark c4c32_le_base/__crc32_le_base alias as __pure")
But omits the runtime selection which uses machinery that differs significantly in Linux 4.1. We assume CRC support is always available.
I'm also preparing a patch to accelerate crc32_be. With them all done, we can get rid of about 32K of code and tables for the slice-by-8 software solution, which should more than pay for the size of enabling the arm-ce crypto.
This version is also an earlier, less-pipelined version of the upstream code. Commit logs suggested that it would be slightly faster on A53 than the latest version, but it seems that may not be the case. I might update it after more tests.
Speed-up is nearly 8x for the LE ops, and over 5x for the BE ops.
Updated to Linux 4.21 version - it is about 35% faster on my RT-AX88U, despite upstream changelog suggesting it was slightly slower on A53.
Original test time: 170 µs 4.20 version time: 29 µs 4.21 version time: 21 µs
I tested the changes, work faster on my RT-AX88U
Is there a particular reason why the code is not making use of carry-less multiplication (using the pmull
instruction)? On my RT-AC86U, /proc/cpuinfo
does advertise that feature. Could some tricks from MariaDB/server#1652 be adopted? Obviously, we would want compile-time detection instead of runtime detection here.
Note: I am not too familiar with ARMv8 implementations or router SoCs. It might bee that pmull
is not supported by some ARMv8 SoCs that this code base is targeting.