rally icon indicating copy to clipboard operation
rally copied to clipboard

[SSL: TOO_MANY_KEY_UPDATES] too many key updates During Bulk Indexing

Open inqueue opened this issue 3 years ago • 4 comments
trafficstars

Rally version (get with esrally --version): esrally 2.6.1.dev0 (git revision: 20f7600)

Invoked command:

esrally race --pipeline=benchmark-only --target-hosts="10.10.27.206:9200,10.10.28.186:9200,10.10.35.82:9200,10.10.38.34:9200" --track-repository="internal" --track="solutions/logs" --challenge="logging-indexing" --track-params=indexing-logging-with-snapshots-experiment-0-track-params.json --client-options=indexing-logging-with-snapshots-experiment-0-client-options.json --load-driver-hosts=$load_driver_hosts

OS version: Ubuntu 18.04.4 LTS

Description of the problem including expected versus actual behavior:

OpenSSL is returning an intermittent [SSL: TOO_MANY_KEY_UPDATES] too many key updates (_ssl.c:2635) error during bulk indexing which stalls connection to the target Elasticsearch node for 60s. This can negatively impact benchmark results. The error is entirely unexpected.

Steps to reproduce:

The behavior was logged 237 times in an indexing benchmark spanning multiple days. No attempt outside of a running benchmark has been made to reproduce it.

rally.log

2022-07-14 21:35:22,647 -not-actor-/PID:27448 esrally.rally INFO OS [uname_result(system='Linux', node='rally-0', release='5.3.0-1023-aws', version='#25~18.04.1-Ubuntu SMP Fri Jun 5 15:18:30 UTC 2020', machine='x86_64', processor='x86_64')]
2022-07-14 21:35:22,647 -not-actor-/PID:27448 esrally.rally INFO Python [namespace(_multiarch='x86_64-linux-gnu', cache_tag='cpython-38', hexversion=50859504, name='cpython', version=sys.version_info(major=3, minor=8, micro=13, releaselevel='final', serial=0))]
2022-07-14 21:35:22,652 -not-actor-/PID:27448 esrally.rally INFO Rally version [2.6.1.dev0 (git revision: 20f7600)]
2022-07-14 21:35:22,652 -not-actor-/PID:27448 esrally.utils.net INFO Connecting directly to the Internet (no proxy support
...
2022-07-20 14:03:29,526 ActorAddr-(T|:46637)/PID:25724 elasticsearch WARNING POST https://10.10.28.186:9200/_bulk [status:N/A request:0.007s]
Traceback (most recent call last):

  File "/home/esbench/.local/lib/python3.8/site-packages/elasticsearch/_async/http_aiohttp.py", line 291, in perform_request
    async with self.session.request(

  File "/home/esbench/.local/lib/python3.8/site-packages/aiohttp/client.py", line 1138, in __aenter__
    self._resp = await self._coro

  File "/home/esbench/.local/lib/python3.8/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)

  File "/home/esbench/.local/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]

  File "/home/esbench/.local/lib/python3.8/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter

aiohttp.client_exceptions.ClientOSError: [Errno 1] [SSL: TOO_MANY_KEY_UPDATES] too many key updates (_ssl.c:2635)

2022-07-20 14:03:29,527 ActorAddr-(T|:46637)/PID:25724 elasticsearch WARNING Connection <AIOHttpConnection: https://10.10.28.186:9200> has failed for 1 times in a row, putting on 60 second timeout.

CPython here is receiving the error from libssl. Though the error was observed in Rally in the elasticsearch-py async client, it is difficult to tell if this behavior is the result of something Rally, ES async client, aiohttp, or OpenSSL itself is doing. It comes from OpenSSL TLSv1.3 KeyUpdate. KeyUpdate happens when the sender is updating its sending crypto keys, further described in RFC 8446, section 4.6.3. It reads like a protection mechanism and it feels like something isn't quite right with TLSv1.3 usage someplace. We might be able to uncover more with packet captures and process traces.

inqueue avatar Jul 20 '22 21:07 inqueue

Intriguing! This appears to be an OpenSSL bug that was fixed in 1.1.1b: https://github.com/openssl/openssl/pull/8299. I don't know what version was used here but it looks like 18.04 only includes 1.1.1 (without any letter ie. without that fix): https://packages.ubuntu.com/bionic/openssl

pquentin avatar Jul 20 '22 23:07 pquentin

Thank you @pquentin, that could be it! The OpenSSL version:

$ openssl version
OpenSSL 1.1.1  11 Sep 2018

inqueue avatar Jul 20 '22 23:07 inqueue

I'm afraid there isn't much we can do on our side to work around this issue. We could eg. close the connection every X requests but that feels like a hack when the real fix for long running benchmarks is to use a slightly newer version of OpenSSL. We can however warn when we detect OpenSSL<1.1.1b.

pquentin avatar Jul 21 '22 23:07 pquentin

Intriguing! This appears to be an OpenSSL bug that was fixed in 1.1.1b: https://github.com/openssl/openssl/pull/8299. I don't know what version was used here but it looks like 18.04 only includes 1.1.1 (without any letter ie. without that fix): https://packages.ubuntu.com/bionic/openssl

Ubuntu backports changes from public builds to their own build number. This is explained more eloquently here

DJRickyB avatar Aug 09 '22 21:08 DJRickyB