dropbear icon indicating copy to clipboard operation
dropbear copied to clipboard

RSA key generation time seems to have doubled

Open davidbernard04 opened this issue 3 years ago • 7 comments
trafficstars

Hi! We are using dropbearkey in embedded systems, and noticed that the average delay for generating a RSA key of 2048 bits has doubled with 2022.82 version, compared to our previous 2018.76 version.

For example on these two low-power platforms:

  • MIPS @ 150MHz: Now takes around 80 sec instead of 40 sec
  • ARMv5 @ 456MHz: Now takes around 40 sec instead of 20 sec

It is of course an average, because the generation time varies greatly. Note that above numbers are based on 40 keys generated (20 keys for each version, alternately generated). I know it's not a tremendous amount of tests, but seems enough to see a trend.

Were you aware of such impact?

davidbernard04 avatar Jun 15 '22 02:06 davidbernard04

I wasn't aware of it, though haven't benchmarked. libtommath was updated in 2020.79 which seems the most likely change. Update LibTomMath to 1.2.0

update ltm to 1.1.0 and enable FIPS 186.4 compliant key-generation (The "FIPS 186.4 compliant" change should have actually made 2048 RSA faster, 6 rounds rather than 8)

There are also a few changes to dbrandom.c though those seem less likely to change speed.

mkj avatar Jun 15 '22 04:06 mkj

About FIPS compliancy, it is actually 12 rounds rather than 8, since primes size for a 2048-bit RSA key are 1024 bits. But I've made a few test loops by hardcoding 8 rounds on the 2022.82 version, and it has no significant impact (some loops are actually worst, lol).

I've made another run of test with 80 keys generated on our ARMv5 platform, and that's in line with previous results:

  • 2018.76: 15.9 sec average for 80 keys
  • 2022.82: 38.1 sec average for 40 keys (with default 12 rounds)
  • 2022.82: 33.6 sec average for 40 keys (with hardcoded 8 rounds)

So I guess that's really the LibTomMath upgrade that affected the generation time.

Any idea what we can do to improve? Otherwise I suppose we'll have to increase our generation timeout on our low-power platforms. Well, that's the cost of progress. ;)

davidbernard04 avatar Jun 15 '22 14:06 davidbernard04

Ah, the CFLAGS order changed in libtommath's own makefile. Previously its own -O3 came after Dropbear's default -Os, in 2022.82 it builds with -Os. Could you try the patch below? I see 30kB extra binary size on x64 though, will have to have a look at that.

--- a/libtommath/makefile_include.mk
+++ b/libtommath/makefile_include.mk
@@ -104,7 +104,7 @@ LIBTOOLFLAGS += -no-undefined
 endif
 
 # add in the standard FLAGS
-LTM_CFLAGS += $(CFLAGS)
+LTM_CFLAGS := $(CFLAGS) $(LTM_CFLAGS)
 LTM_LFLAGS += $(LFLAGS)
 LTM_LDFLAGS += $(LDFLAGS)
 LTM_LIBTOOLFLAGS += $(LIBTOOLFLAGS)

For reference 2018.76 (from ps aux so may be a bit different) -D _FORTIFY_SOURCE=2 -D DROPBEAR_SERVER -D DROPBEAR_CLIENT ../../libtommath/bn_mp_fwrite.c -quiet -dumpbase bn_mp_fwrite.c -dumpbase-ext .c -mtune=generic -march=x86-64 -g -Os -O3 -Wno-pointer-sign -Wall -Wsign-compare -Wextra -Wshadow -Wsystem-headers -Wdeclaration-after-statement -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wpointer-arith -fno-strict-overflow -fPIE -fstack-protector-strong -funroll-loops -fomit-frame-pointer -fasynchronous-unwind-tables -fstack-protector-strong -Wformat-security -fstack-clash-protection -fcf-protection

2022.82 -Wall -Wsign-compare -Wextra -Wshadow -Wdeclaration-after-statement -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wpointer-arith -Wsystem-headers -O3 -funroll-loops -fomit-frame-pointer -Os -W -Wall -Wno-pointer-sign -fno-strict-overflow -fPIE -fstack-protector-strong -D_FORTIFY_SOURCE=2 -Werror -ggdb -I../../libtommath -I../libtomcrypt/src/headers/ -I../../libtommath/../libtomcrypt/src/headers/ -I../ -I../../libtommath/../ -Wno-deprecated -I../libtomcrypt/src/headers/ -DLOCALOPTIONS_H_EXISTS -I. -I.. -DDROPBEAR_SERVER -DDROPBEAR_CLIENT ../../libtommath/bn_mp_div_2.c -o bn_mp_div_2.o

mkj avatar Jun 15 '22 15:06 mkj

Nice catch! I've tried with your patch and there is an improvement:

  • 2022.82: 35 sec average (-Os)
  • 2022.82: 25 sec average (-O3)

Although still not as fast as the 2018 version:

  • 2018.76: 16 sec average (-O3)

About binary size, there is an extra of ~20 KB on each dropbear and dropbearkey executables between -Os and -O3 for the 2022.82 version (ARM platform).

FYI on our side, we might choose to stick with the -Os flag to save some space (we have some platforms in dire need of free space, lol).

davidbernard04 avatar Jun 16 '22 16:06 davidbernard04

OK right. When I get a chance I'll see if I can compare perf of the releases (and rummage for some older boards to try). On x64 here the make timing; ./timing benchmark of standalone libtommath is the same between the versions.

mkj avatar Jun 18 '22 05:06 mkj

@mkj On second thought, I personally prefer the makefile_include.mk before your patch above. This way, we can control the optimization flag for everything (dropbear, libtomcrypto and libtommath) without having one of the lib that forces a -O3 regardless of the CFLAGS we specified.

davidbernard04 avatar Jul 14 '22 17:07 davidbernard04

I think having a default of -O3 for the crypto/maths libraries makes sense (that's where performance matters), but -Os is a better default for non-performance code. I guess there could be defaults for each of them which can be overridden with environment variables - LIBTOM_CFLAGS=-O3 from Dropbear's configure script perhaps.

mkj avatar Jul 18 '22 03:07 mkj

@davidbernard04 that's caused by internal changes of libtommath where previously the prime verification simply wasn't as good as it is now... and this comes at a speed penalty.

tl;dr - if you are only using RSA you can set the compile-time flag LTM_USE_ONLY_MR of libtommath (I hope you use the bundled version :) this makes life here easier for once), which will disable the LR test and you should be fine to go. Please let me know how your performance numbers evolve, since there were some other changes in this area.


longer story...

2018.76 still used ltm 1.0.1 which always only did t deterministic Miller-Rabin Tests for primality testing. (That's definitively not what you want). Since ltm 1.1.0 we run t rounds of the MR test against random numbers + a Lucas-Selfridge test. That's how FIPS 186-4 expects how primality testing is done ideally. Before it was simply not according to "the standard"^TM.

After having a quick look into FIPS 186-4 I just realized that it states in Appendix C.3 that for RSA keys there's no requirement for a LR test, only for DSA keys.

... maybe we overshot a bit on the paranoia scale? Maybe, but I prefer to offer a secure default and users can then still decide whether performance is more important to them.


@mkj:

please feel free to tag me in such issues, I don't watch dropbear that closely and only stumbled over this by chance.

I'll take this back to the project and check whether we can find a compatible way where LR tests can be enabled at compile time, but disabled at runtime. Currently that's not possible.

I'd say you can safely turn on the LTM_USE_ONLY_MR switch if DSA key generation is disabled. For configurations that require both, RSA and DSA key generation at runtime, I don't have a good solution ready. Do you even still have DSA key generation enabled per default?

One possibility would be to always enable LTM_USE_ONLY_MR and use the number of MR iterations of column 'M-R Tests Only' of Table C.1 in FIPS 186-4 for DSA and the numbers returned by mp_prime_rabin_miller_trials() for RSA...

sjaeckel avatar Oct 04 '22 18:10 sjaeckel

@sjaeckel Thanks a lot for your feedback! I've made some tests on the 2022.82 version with the LTM_USE_ONLY_MR define (on our ARMv5 platform), with 80 keys generated for each test below.

For -Os size-optimized build:

  • 36 sec (no define)
  • 26 sec (with the MR define)

For -O3 speed-optimized build:

  • 24 sec (no define)
  • 20 sec (with the MR define)

So it does improve performance for RSA keys generation!

But our app do generate both RSA and ECDSA keys, so unless you or @mkj implement a way for LR tests to be dynamic at runtime (i.e. enabled only for DSA keys), it is preferable that we don't use the LTM_USE_ONLY_MR define.

davidbernard04 avatar Oct 12 '22 15:10 davidbernard04

I've now added a LTM_CFLAGS argument to configure - it defaults to -O3 -funroll-loops like before, but you can set ./configure LTM_CFLAGS=-Os https://github.com/mkj/dropbear/commit/54a90ddac59ec1a1b453fb31c5aca1c96061035b

I'm also setting LTM_USE_ONLY_MR only when DROPBEAR_DSS is enabled (and have set DROPBEAR_DSS disabled by default, it's on the way out). https://github.com/mkj/dropbear/commit/0e70732e1ec4db875c93f2b5a6eeb4185384f163

@sjaeckel just to confirm, ecdsa doesn't seem to need primality testing at all does it?

mkj avatar Nov 10 '22 10:11 mkj

That should read "setting LTM_USE_ONLY_MR only when DROPBEAR_DSS is disabled". And I've noticed the check was wrong, fixed https://github.com/mkj/dropbear/commit/960d374e657602d1c6e09080cd4896b242c3eb75

mkj avatar Nov 10 '22 10:11 mkj

@sjaeckel just to confirm, ecdsa doesn't seem to need primality testing at all does it?

Correct, primality testing is only required for RSA and DSA.

sjaeckel avatar Nov 10 '22 11:11 sjaeckel