mbedtls Bad SHA3 performance with GCC

Bad SHA3 performance with GCC

Open gilles-peskine-arm opened this issue 1 year ago • 2 comments

It seems that GCC is pretty bad at optimizing our SHA3 code: it's between 2× and 3× slower than Clang. Here's programs/test/benchmark sha3_256 built with gcc-11 -O3 on my Linux/x86_64 computer:

  SHA3-256                 :     118404 KiB/s,         22 cycles/byte

Compare with clang-14 -O3 on the same machine:

  SHA3-256                 :     241472 KiB/s,         10 cycles/byte

There's a similar ratio with -m32. @dave-rodgman reports a similar difference on armv8 M1. Dave looked into it a bit and found that clang doesn't autovectorize keccak_f1600 whereas GCC does a bit, but turning off auto-vectorization only improves by 10% so that's not the main cause.

With other algorithms, Clang and GCC are within 5% or so of each other. So there seems to be something specific in our SHA3 that GCC doesn't optimize well.

Feb 13 '24 16:02 gilles-peskine-arm

gcc generates bad code for a 64-bit rotate left (ROT64): compare clang:

       0:	4b0103e8 	neg	w8, w1
       4:	9ac82c00 	ror	x0, x0, x8

vs gcc:

     3a0:	12001c21 	and	w1, w1, #0xff
     3a4:	4b0103e1 	neg	w1, w1
     3a8:	9ac12c00 	ror	x0, x0, x1

This is an easy fix if we convert rho and do a right-rotate instead:

static const uint8_t rho[24] = {
   63, 2, 36, 37, 28, 20, 58, 9, 44, 61, 54, 21, 39, 25, 23, 19, 49, 43, 56, 46, 62, 3, 8, 50
};

which then results in simply an ror instruction for both gcc and clang

Feb 13 '24 16:02 daverodgman

In short: I think the difference is mostly due to clang being more aggressive about unrolling loops. Manually unrolling some of the loops tends to help gcc close the gap (although both benefit).

Feb 14 '24 14:02 daverodgman

mbedtls mbedtls copied to clipboard

Bad SHA3 performance with GCC

mbedtls
mbedtls copied to clipboard