mbedtls
mbedtls copied to clipboard
Bad SHA3 performance with GCC
It seems that GCC is pretty bad at optimizing our SHA3 code: it's between 2× and 3× slower than Clang. Here's programs/test/benchmark sha3_256
built with gcc-11 -O3
on my Linux/x86_64 computer:
SHA3-256 : 118404 KiB/s, 22 cycles/byte
Compare with clang-14 -O3
on the same machine:
SHA3-256 : 241472 KiB/s, 10 cycles/byte
There's a similar ratio with -m32
. @dave-rodgman reports a similar difference on armv8 M1. Dave looked into it a bit and found that clang doesn't autovectorize keccak_f1600
whereas GCC does a bit, but turning off auto-vectorization only improves by 10% so that's not the main cause.
With other algorithms, Clang and GCC are within 5% or so of each other. So there seems to be something specific in our SHA3 that GCC doesn't optimize well.
gcc generates bad code for a 64-bit rotate left (ROT64
): compare clang:
0: 4b0103e8 neg w8, w1
4: 9ac82c00 ror x0, x0, x8
vs gcc:
3a0: 12001c21 and w1, w1, #0xff
3a4: 4b0103e1 neg w1, w1
3a8: 9ac12c00 ror x0, x0, x1
This is an easy fix if we convert rho and do a right-rotate instead:
static const uint8_t rho[24] = {
63, 2, 36, 37, 28, 20, 58, 9, 44, 61, 54, 21, 39, 25, 23, 19, 49, 43, 56, 46, 62, 3, 8, 50
};
which then results in simply an ror instruction for both gcc and clang
In short: I think the difference is mostly due to clang being more aggressive about unrolling loops. Manually unrolling some of the loops tends to help gcc close the gap (although both benefit).