openssl icon indicating copy to clipboard operation
openssl copied to clipboard

3.0 performance degraded due to locking

Open richsalz opened this issue 2 years ago • 183 comments

Using some internal QUIC tests, we see this timing in 1.1:

$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests  --gtest_repeat=100  --gtest_filter=*ZeroRttDisabled*
$ perf report

+   31.73%    31.55%  quic_lib_tests  quic_lib_tests       [.] bn_sqr8x_internal
+    9.28%     9.28%  quic_lib_tests  quic_lib_tests       [.] mul4x_internal
+    4.91%     4.91%  quic_lib_tests  quic_lib_tests       [.] sha256_block_data_order_avx

In 3.0 we see this:


$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests  --gtest_repeat=100  --gtest_filter=*ZeroRttDisabled*
$ perf report

+   11.02%    10.99%  quic_lib_tests  quic_lib_tests       [.] bn_sqr8x_internal
+    8.38%     8.08%  quic_lib_tests  libpthread-2.31.so   [.] __pthread_rwlock_rdlock
+    7.65%     7.51%  quic_lib_tests  libpthread-2.31.so   [.] __pthread_rwlock_unlock
+    4.98%     4.78%  quic_lib_tests  quic_lib_tests       [.] getrn
+    4.14%     4.11%  quic_lib_tests  quic_lib_tests       [.] mul4x_internal
+    3.37%     2.57%  quic_lib_tests  quic_lib_tests       [.] ossl_tolower
     3.30%     3.30%  quic_lib_tests  quic_lib_tests       [.] ossl_lh_strcasehash
+    2.72%     2.13%  quic_lib_tests  quic_lib_tests       [.] OPENSSL_strcasecmp
+    2.29%     2.05%  quic_lib_tests  quic_lib_tests       [.] ossl_lib_ctx_get_data
+    1.93%     1.93%  quic_lib_tests  quic_lib_tests       [.] sha256_block_data_order_avx

This seems to be part of OSSL_DECODER_CTX_new_for_pkey 16% of the time is spent in locking, on a single threaded binary. And 10% is in a string hashtable lookup.

If anyone on the project is going to look at this, I will try to get a small reproducer. But our the time for our QUIC tests is doubling.

richsalz avatar Feb 14 '23 17:02 richsalz

Yes, a small reproducer would be great. I'll try to look at it.

beldmit avatar Feb 14 '23 20:02 beldmit

Here is a small reproducer in C++, written by an Akamai colleague.

The code is at https://gist.github.com/richsalz/31a98a3095fa36ab6a66082a0c028a6f

Here are the compile commands and timing. This small reproducer is 17 times slower. When used in our QUIC implementation, small transfers average twice as slowly. Microsoft's MSQUIC (their production-quality open source stack on Windows and Azure) avoids use of 3.0 for this exact reason; they use the 1.1 branch of @quictls/openssl

$ g++ -O2 -Wall -std=c++17 test_25519.cc -I openssl1/common/include/ openssl1/common/lib/libcrypto.a -lpthread -ldl -o bin.11
$ g++ -O2 -Wall -std=c++17 test_25519.cc -I openssl3/common/include/ openssl3/common/lib/libcrypto.a -lpthread -ldl -o bin.30

$ time ./bin.11

real	0m3.484s
user	0m3.481s
sys	0m0.002s

$ time ./bin.30 

real	0m53.588s
user	0m53.566s
sys	0m0.020s

richsalz avatar Feb 16 '23 15:02 richsalz

I have some good news. Current master (and 3.1, I presume) is 4 times faster than 3.0.

I will still look if there is smth obvious to fix in 3.1

beldmit avatar Feb 16 '23 20:02 beldmit

ossl_namemap_name2num_n with descendants (including locks/unlocks) liiks like a hotspot to me - 45% according to callgrind in master. @paulidale @hlandau - could you please take a look?

beldmit avatar Feb 16 '23 20:02 beldmit

e.g. could we avoid CRYPTO_strndup/OPENSSL_free calls here (~10% in this synthetic example)?

beldmit avatar Feb 16 '23 21:02 beldmit

NOTE: @beldmit If you are looking at timing, do the calls twice, and measure starting from the second one.. The first will cache the fetches..

slontis avatar Feb 17 '23 02:02 slontis

I am mostly looking at the amount of calls.

beldmit avatar Feb 17 '23 07:02 beldmit

Commenting the CRYPTO_strndup/OPENSSL_free causes gazillion tests to fail and 3 times speedup. Weird.

beldmit avatar Feb 17 '23 19:02 beldmit

Nevermind, looked into the code :)

beldmit avatar Feb 17 '23 19:02 beldmit

This is on my to look at list but I'm pretty busy following last week's face to face and I've a huge pile of other stuff to look at.

@beldmit, I have difficulty believing that strndup/free is the culprit. Locking is more likely as @richsalz reported. 3.1 should have improved things in this regard by avoiding the more egregious locking problems. I'm hopeful but I cannot say if any others remain at this point. Time will tell.

However, I'm not certain that locking fixes alone will be sufficient. @richsalz's code directly encounters one of the known problems: decoders are slower in 3.0 due to the provider architecture. We need to be able to support a provider loading a key in a format that libcrypto doesn't know about and this means multiple parsing attempts. There simply isn't any other option. That doesn't mean we can't streamline things further -- we intend to look more deeply into this but contractor time is a limited resource and there are many other priorities. Rest assured that performance is definitely on the backlog and is important to the project.

As for MSQUIC, there is a quick (sic) win available by refactoring the QUIC code to not create a new SSL_CTX on every connection. Doing this in 1.1.1 is a moderately heavy operation, doing it in 3.0 is quite a bit heavier. I suspect that avoiding this will improve things significantly, both for 1.1.1 and 3.0/3.1. Unfortunately, I do not see myself getting time to contribute such a patch to MSQUIC any time soon.

Volunteers who can contribute such changes are welcomed of course.

paulidale avatar Feb 20 '23 05:02 paulidale

Yes, long-term run shows that strdup/free are not tge culprit. I'll attach the callgrind results later

beldmit avatar Feb 20 '23 07:02 beldmit

callgrind.out.655458.txt What looks strange to me is 153 millions of call to ossl_decoder_fast_is_a with 20% of overal time (with descendants).

beldmit avatar Feb 20 '23 08:02 beldmit

We need to be able to support a provider loading a key in a format that libcrypto doesn't know about and this means multiple parsing attempts. T

The project went too far. In order to support things that aren't currently available, the project is hurting -- a lot -- current users. That is the wrong trade-off. (This kind of "make it flexible, we can't see future needs" has been an endemic problem in OpenSSL code since the beginning, going back to SSLeay.)

@nibanks you might want to take a look at Pauli's suggestions above. My feeling, however, is that "quite a bit heavier" for flexibility that nobody can currently use is a mistake.

richsalz avatar Feb 20 '23 13:02 richsalz

cc @wfurt to possibly help here

Perf with 3.0 is really a huge issue with MsQuic and I'm not sure we can really adopt it seriously as is. We can't just artificially work around things just for the sake of making a perf test look good. We need to make sure to exercise real world scenarios.

nibanks avatar Feb 20 '23 13:02 nibanks

Dear @nibanks I'd split things apart.

  1. The redesign to provider causes significant slowdown in some specific areas. It's a problem to be solved on OpenSSL level for corner cases or in general.

  2. The architectural difference between 1.1.1 and 3.0 series is significant itself. So the approaches worked well for 1.1.1 should be reconsidered for 3.0. @paulidale provided a totally relevant recommendation and ignoring it looks like a suboptimal decision to me.

beldmit avatar Feb 20 '23 15:02 beldmit

Is there list of know deficiencies @beldmit? I look at the linked https://github.com/microsoft/msquic/pull/2588 and it make sense. We could look for patterns like that if if is clear what the caveats are. Aside from Quic, I also work on SslStream that empowers all TLS operations in .NET. While we did the work to support OpenSSL 3 we did not run any benchmarks AFAIK as at that time no distro we support was using it. I'm going to put it in my TODO list as well.

wfurt avatar Feb 20 '23 18:02 wfurt

@wfurt #17627 contains some of these issues. Loading keys/certificates is currently much slower than in 1.1.1, but IRL it's usually one-time operation.

beldmit avatar Feb 20 '23 18:02 beldmit

On our side, we spent some time trying to use a simple HTTP client / load generator on Ubuntu 22 and were stumped to see it saturate at around 400 connections per second for 48 threads while on other machines it worked fine. We noticed that there was something completely odd regarding locking as it was impossible for the process to use much more than one CPU and that most of the time was spent in futex_wake() and friends. After trying many things we finally locally built 1.1.1 and linked against it and suddenly the performance reached 74000 cps, or almost 200 times more.

Additionally, on the server side we already noticed an extreme use of locking, with chains starting like this one and involving pthread_rwlocks:

  92.80%  [kernel]                  [k] native_queued_spin_lock_slowpath.part.0

I instrumented the code in hope to find "the" culprit and found something like 80 calls to any flavor of pthread_rwlock_*() in a single accept/get/response/close cycle! 1.1.1 already had quite a bunch of them (I don't remember how many) but significantly less overhead, as if it was used on shorter chains. In the end we reimplemented our own locks (same principle as the lock callbacks that were unfortunately dropped from 1.1) and could more than double the server-side performance thanks to using much lighter locks.

So yes, there definitely is a big locking problem there. The rule of thumb for a call to any pthread_* function should normally be "am I really willing to give up the CPU and switch to another task, possibly thrashing all the L1 cache here?". Of course that's not exact but in terms of cost and impacts it's extremely close to reality. pthread_* functions should only be for high-level code (browsers etc) and rare cases. It's way too expensive for low-level or library code.

I failed to find any single occurrence of an atomic operation in the whole codebase, which makes me think that all of them are instead implemented by hand around a lock, so that's a dead end, when you consider that scalable code normally tries hard to avoid atomic ops because they slow things down, and that pthread_* functions do involve atomic ops and a syscall in the contented case that is at least 100 times slower than the atomic op itself...

For now we're condemned to stick to 1.1.1, as 3.0 is not just "degraded", it's simply not realistically usable at all for anything beyond the openssl command-line utility for us. This will pose security issues in field over time once 1.1.1 stops being supported, as power users are only starting to discover the big regression now by progressively switching to new distros. But for now we have no other choice.

wtarreau avatar Feb 21 '23 17:02 wtarreau

@wtarreau, there was a bug in 3.0 where multiple threads ended up completely serialised which killed multiprocessor performance and, from your description, I think you hit this. This was fixed in #18151 which was merged to later 3.0 releases.

OpenSSL does use atomic operations for reference counting (mostly). It was quite a performance win when it was introduced.

@wfurt, the rule of thumb is pre-load what you're going to use. Fetch is an additional cost over 1.1.1, so pre-fetch. Certificate/key loading is slower, so pre-load where possible. Avoid the convenience calls like EVP_sha256() because these do a (late) fetch internally.

@richsalz, the post quantum provider is using provider based decoders. I am glad that the flexibility was included to enable such.

paulidale avatar Feb 21 '23 21:02 paulidale

@paulidale, Yes, I know about OpenQuantumSafe. It could have been done by them creating EVP_PKEY's, etc., by hand. That might not be as clean, admittedly, but you could have added EX_DATA to those objects, for example. Again, this is hurting everyone for what is still a niche offering.

Or you could do the "freeze" concept I offered in #15600.

Or you could have an internal hard-wired path that avoids lookup for default-enabled keys.

Is "avoid convenience calls like EVP_sha256" documented anywhere?

Lots of ways to approach this, and the project doesn't seem to be spending much effort on it, unless I am missing something.

richsalz avatar Feb 21 '23 21:02 richsalz

@paulidale regarding 18151 I doubt given that it was fixed 9 months ago already and the problem was still present in November or so (I don't remember what exact version was the last one by then, probably 3.0.7). And even the last tests we'd run on 3.1-dev by then were less bad than 3.0 but still far from 1.1.1. OK interesting for atomic ops, I'm just surprised not to have been able to spot them though.

wtarreau avatar Feb 21 '23 21:02 wtarreau

Is "avoid convenience calls like EVP_sha256" documented anywhere?

It appears not.. It should be so here it is.. https://github.com/openssl/openssl/pull/20354

slontis avatar Feb 22 '23 00:02 slontis

I instrumented the code in hope to find "the" culprit and found something like 80 calls to any flavor of pthread_rwlock_*() in a single accept/get/response/close cycle!

@wtarreau - I wonder if you have any documentation or notes from that time which might help us identify which specific locks were causing the problem? I'm assuming the "80 calls" were not from 80 different call sites, but certain calls being repeated multiple times?

mattcaswell avatar Feb 22 '23 09:02 mattcaswell

Hi Matt,

most of the notes I had were regarding 1.1.1 which we'll stick to, and for now I don't have access to this machine. But I think I will eventually again relatively soon, so as time permits I'll try to collect more info. The thing is that it's difficult to observe the call places because the time wasted calling locks is not evenly spread due to context-switches and sleeps in the kernel. In my case it's by redefining my own pthread_rwlock_*() functions that I had the idea of just adding a call counter and was amazed by the number of calls per connection. I do have some variants of these functions for debugging purposes which disable exponential back-off and which more easily permit to collect call graphs using perf, they're just a bit less easy to use and less precise.

wtarreau avatar Feb 22 '23 21:02 wtarreau

But I think I will eventually again relatively soon, so as time permits I'll try to collect more info.

The other thing to bear in mind is that there are numerous performance improvements in 3.1. It's still not at 1.1.1 levels, but its better than 3.0. If you do re-run the exercise, doing it against 3.1 would be most helpful for us to figure out what we still have left to do.

mattcaswell avatar Feb 23 '23 09:02 mattcaswell

Hi @mattcaswell

I could finally connect to the machine today and run the various tests for you. I compared openssl versions 1.1.1t, 3.0.8, 3.1.1-dev (commit 004bd8f97d), as well as wolfssl 5.5.4 (which uses very little locking). These were running on haproxy-2.8-dev8-c05d30, with regular pthread_rwlocks and with our own lighter rwlocks (plock). The machine is an intel Sapphire Rapids dual 8480+ (2 x 56 cores), hence 224 threads total. The tests only involved server mode (i.e. haproxy accepting connections) because the client mode in 3.0 is not even usable at all beyond just command-line tools.

The results are the following, in number of HTTPS connections per second, TLSv1.3, rsa2k cert:

  • openssl 1.1.1: pthread: 44k plock: 53k
  • openssl 3.0.8: pthread: 10-15k plock: 19k # 3.0 is unstable with pthread
  • openssl 3.1.1: pthread: 26k plock: 28k
  • wolfssl 5.5.4: pthread: 120-122k plock: 120-122k # reached load generator's limits.

In single core, all of them show the same performance, so the only difference is locking. I re-enabled the lock counting in the plock version, since we redefine the pthread_rwlock functions, it's easy to do. These were measured over 1 million total connections. The results below are the number of calls to the lock functions per connection:

  • openssl 1.1.1: 120 rdlock, 276 wrlock, 396 unlock
  • openssl 3.0.8: 635 rdlock, 56 wrlock, 691 unlock
  • openssl 3.1.1: 164 rdlock, 57 wrlock, 220 unlock
  • wolfssl 5.5.4: 0 rdlock, 1 wrlock, 1 unlock

As you see it is not complicated, the performance is inversely proportional to the number of lock calls. It's very likely that the extra rdlocks in 3.1 compared to 1.1.1 are still far too much because performed on very thin operations. Also what is not seen here are the atomic ops, but maybe 3.1.1 still has a lot more shared data that are also accessed via atomic ops. And based on the performance numbers above, even though 3.1 recovers on some of the massive defects of 3.0, 3.1 still requires twice as many servers as 1.1.1 to process the same workload.

As promised I got a callgraph on 3.1. It's huge so I'm compressing it. I'm also attaching the flame graph but it's not really exploitable beyond seeing what are the main direct callers of the locks. pthread_rwlock_* here are our functions. pl_wait_unlock_long is the relaxation function that's waiting for someone else to release the lock, that's why it appears a lot. I'm well conscious that when you're dealing with 220 locks on a single connection, for sure these won't be removed in an eye blink, as I'm certain it already took time to add them :-(

I do really think that one of the biggest mistake in 1.1 was the removal of the locking callbacks, because in addition to adding an overdose of locks, newer versions don't even offer the possibility to switch to something lighter than pthread anymore, and that's extremely frustrating. I tend to think that if there is one low hanging fruit for the short term, it's definitely re-enabling a locking API. And please, not one using a generic unlock, but a separate read_unlock and write_unlock, which the user may both map to a the same generic unlock function if they want, but generally speaking, having to read a lock just to figure how to unlock it doubles the unlocking cost just for wrong API reasons, and that's exactly the type of thing that end up with a heavy mechanism like you have in pthread.

Hoping this helps. ossl31-pthemu-perfreport.txt.gz ossl31-pthemu-perfflame

wtarreau avatar Apr 28 '23 17:04 wtarreau

I have worked on embedded systems with multiple levels of threading, and removal of the locking API in favor of pthread would have definitely impacted that (I left before 1.1.1 was released).

tmshort avatar May 03 '23 14:05 tmshort

Embedded/severely constrained systems are no longer a target for OpenSSL. That ship has sailed long time ago. That of course does not imply we should not try to resolve performance problems caused by locking. But IMO no, adding back custom locking callbacks is not the way to go.

t8m avatar May 03 '23 15:05 t8m

Perhaps "--thrads=external" is the way to support external locking systems. Then you document what the various OSSL lock functions have to do.

richsalz avatar May 03 '23 15:05 richsalz

The thing is, you want to be able to support them with pre-built versions found in distros. Pthread being the default can be OK, but it's really needed to be able to switch it at boot time at least. And frankly, look above, we're not speaking about an embedded system, but about a server, where poor locking divides the performance by 4. It's really important to permit users not to have to rely on a needlessly heavy API for locking.

wtarreau avatar May 03 '23 16:05 wtarreau