valkey [BUG] 0A00010F:SSL routines::bad length causing high latency and reconnects

Describe the bug

When running a high throughput Redis cluster (3 shards, 1 replica per shard) the read replicas start to hit a strange issue where we see very frequent reconnects and an inability to maintain a stable connection. Our cluster is set up with TLS (self-signed cert) and after enabling verbose logging I correlated these log lines to the reconnects that we're seeing in our application:

1:S 08 Oct 2024 00:41:39.521 - Error writing to client: error:0A00010F:SSL routines::bad length

To reproduce

I have attached two simple Kotlin apps that can reproduce this behavior, even against a local 3 shard cluster. This does not appear to be an io-threads issue because I set io-threads: 1 to rule that out as being part of the cause.

valkey-push.tar.gz valkey-range.tar.gz

Expected behavior

When running the LPUSH on the primary nodes and the LRANGE + UNLINK on the read replica it should be able to handle the throughput without frequent reconnects and bad length errors in the verbose logs.

Additional information

I spent some time adding additional logging, trying to figure out the source of the bug and so far I haven't found the conclusive cause but I do have some hints. The error above happens when SSL_write returns back error code SSL_ERROR_WANT_WRITE but the next write that occurs contains a different length than it expected (since SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER appears to be enabled it doesn't need to be the exact same buffer).

What I see happening locally is we have some data to write (say, 81856 bytes). In the _writeToClient method the replyList is not empty so we jump into writevToClient(). In this method the 81856 bytes get split up into 4 arrays and put into iov (with an iovcount of 4. Each array has 20464 bytes of data.

These arrays get passed into connTLSWritev() in tls.c. The total size of the array is larger than NET_MAX_WRITES_PER_EVENT (64k) so it's going to invoke connTLSWrite one time for each array. The first call happens and it tries to write 20464 bytes of data but it's actually only able to write 16k (16384) bytes. We hit the following statement and break out of the loop:

if ((size_t)sent != iov[i].iov_len) break;

That's going to send us back to writevToClient() in networking.c. We increment the totwritten and modify the underlying iov arrays to move past the 16k of data that we just wrote. After this, we call back into connTLSWritev() in tls.c. This time we have 81856 - 16384 = 65472 bytes to write. This is just under the 64k limit so we copy the arrays into a single buffer and try to write it out. Again, the SSL_write only writes 16kb of data (I think this is the max allowed per write?) and we repeat the process above until we've written all the data out.

This seems to work fine most of the time but I'm frequently seeing cases where it appears that the c->reply list on the connection has a different length between the time that we write out the first 16kb of data and the time that we re-enter writevToClient() and attempt to send the un-sent data (e.g. - the exact payload that the SSL_write call is expecting to see). It's not clear to me whether or not it's expected that the c->reply list would change in the midst of writing a payload, but I assume it's just continually appended to? And perhaps the issue here is just that there's no handling of the case where an exact payload needs to be re-written?

I still have a pretty weak grasp on a lot of this code so I'm really hoping that someone who knows a lot more than me would be able to tell what's going on. I'm more than happy to try out any code changes locally to validate a fix.

Oct 08 '24 01:10 towest-nr

Pretty anecdotal but updating NET_MAX_WRITES_PER_EVENT (1024 * 64) to NET_MAX_WRITES_PER_EVENT (1024 * 16) either gets rid of the SSL errors or they are infrequent enough that I can't repro it locally anymore. I have no idea what other affects this would have though.

Oct 08 '24 02:10 towest-nr

ok, I found an even easier to way to reproduce this issue. This might even be possible to repro on a single Redis node (not clustered, no replicas) but I'll test that out later today.

I am able to reliably reproduce the problem by simply running valkey-benchmark:

valkey-benchmark -a <auth_token> --tls --cacert /tls/ca.crt --cluster -d 32768 -n 100000 -r 100000 --threads 10 -t lrange

Depending on where in the valkey code it fails the benchmark will return failure modes like:

Summary:
  throughput summary: 24026.91 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.751     0.040     0.607     2.095    54.207    76.799
Error: Server closed the connection
Error: Success
Error: Success

or

Summary:
  throughput summary: 19447.69 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.340     0.048     0.799     2.511    62.367    85.695
Error: Server closed the connection

or even this fun one:

Summary:
  throughput summary: 15130.88 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        3.034     0.088     1.751     6.879    35.231   108.287
Error: Server closed the connections=10.6 (overall: 277.8) avg_msec=9.739 (overall: 9.739)
Segmentation fault

If you have verbose logging turned on you can see the SSL bad length error in the logs on the node:

1:M 08 Oct 2024 17:59:33.081 - Error writing to client: error:0A00010F:SSL routines::bad length

Oct 08 '24 18:10 towest-nr

FWIW, I tried running this locally using a single valkey instance natively on my M1 and I'm struggling to reproduce this. I was able to repro it every time while running in a local k8s environment so I'm not yet sure what the key is to getting it to happen every time.

Oct 08 '24 19:10 towest-nr

@madolson Not sure if you had a chance to look into this yet or not but I have another datapoint and steps to reproduce with a single node running locally.

make BUILD_TLS=yes
./utils/gen-test-certs.sh
./src/valkey-server --tls-port 6379 --port 0 --tls-cert-file ./tests/tls/valkey.crt --tls-key-file ./tests/tls/valkey.key --tls-ca-cert-file ./tests/tls/ca.crt --prefetch-batch-max-size 128 --max-new-tls-connections-per-cycle 5 --loglevel verbose

In another terminal:

./src/valkey-benchmark --tls --cert ./tests/tls/valkey.crt --key ./tests/tls/valkey.key --cacert ./tests/tls/ca.crt -d 32768 -n 100000 -r 100000 --threads 50 -t lrange

This reproduces nearly every time for me with this config. However, even if I revert these configs in our testing environment I still continue to see the bad length issues under enough load. So it looks like these configs might just be allowing the bug to surface under a lower level of load.

Oct 10 '24 18:10 towest-nr

Similar issue observed during our consumption with valkey version 8.0.1 running on K8s. Clients are consistently getting "unexpected EOF" and valkey is showing the below error: 618:M 01 Feb 2025 08:49:18.784 . Total: 43 clients connected (0 replicas), 5434504 (5.18M) bytes in use 618:M 01 Feb 2025 08:49:18.846 - Error writing to client: error:1409E10F:SSL routines:ssl3_write_bytes:bad length

Retrying the request is not helping. The problem is consistent.

Feb 04 '25 08:02 rkbk2000

No, I wasn't able to originally reproduce it, and since nobody else brought it up I was just waiting around. It seems like someone else saw it, so going to check again today.

Feb 04 '25 16:02 madolson

@towest-nr What version of OpenSSL are you using?

Feb 04 '25 19:02 madolson

@madolson When I reproduced this locally on my M1 I was running OpenSSL 3.4.0 22 Oct 2024 (Library: OpenSSL 3.4.0 22 Oct 2024) but I also reproduced it AWS running version OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)

Feb 04 '25 19:02 toddwest

Ok, I was testing against OpenSSL 1.1, I'll try a later instance with 3.0.

Feb 04 '25 21:02 madolson

Hmm. On ubuntu 24 LTS with openSSL 3.0.13 I am able to run the commands you sent without seeing the error. I have an M2 at home, so I'll try it again on that and see if I can reproduce it.

Feb 04 '25 21:02 madolson

I was able reproduce this bug.

I got this issues to happen when running tests from unit/other and inside test PIPELINING stresser (also a regression for the old epoll bug), so I use ./runtest --single unit/other --only "PIPELINING stresser (also a regression for the old epoll bug)" --tls --clients 1 --dont-clean to run it. Unfortunately in most cases it will not fail, but when I changed net.ipv4.tcp_wmem to 4096 4096 4096 using sudo sysctl -w net.ipv4.tcp_wmem="4096 4096 4096" (note: use sysctl net.ipv4.tcp_wmem to get your default values before changing it), it's failing every single time.

To fix it with a hack I changed https://github.com/valkey-io/valkey/blob/4e0149a381ac5e030537712f046c235110168b60/src/networking.c#L2094 to (added only *2 after NET_MAX_WRITES_PER_EVENT)

while ((next = listNext(&iter)) && iovcnt < iovmax && iov_bytes_len < NET_MAX_WRITES_PER_EVENT*2) {

or other way (also hacky way) I changed https://github.com/valkey-io/valkey/blob/4e0149a381ac5e030537712f046c235110168b60/src/tls.c#L930 to (added only || 1 to make it always true)

if (iov_bytes_len > NET_MAX_WRITES_PER_EVENT || 1)

When I was looking why error:0A00010F:SSL routines::bad length may happen I found https://github.com/openssl/openssl/issues/21002#issuecomment-1559550080 and they talked about When a write function call has to be repeated because [SSL_get_error(3)](https://docs.openssl.org/master/man3/SSL_get_error/) returned SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE, it must be repeated with the same arguments. The data that was passed might have been partially processed., and as I see there is SSL_ERROR_WANT_WRITE and the buffer is smaller than on previous call to connTLSWrite.

[20628] 11 Feb 17:36:36.998 * connTLSWrite len: 49088 ret: -1
[20628] 11 Feb 17:36:36.999 * connTLSWrite len: 8160 ret: -1
[20628] 11 Feb 17:36:36.999 - Error writing to client: error:0A00010F:SSL routines::bad length

This is what I found so far.

Feb 11 '25 16:02 zori-janea

I tried some more things with this bug and I found that this issue will not happen when buffer is larger, so doesn't looks like it has to be the same but has to be at least the same or larger, so I created change that will not call https://github.com/valkey-io/valkey/blob/4e0149a381ac5e030537712f046c235110168b60/src/tls.c#L933 but force https://github.com/valkey-io/valkey/blob/4e0149a381ac5e030537712f046c235110168b60/src/tls.c#L951 with largest possible buffer but only if first record of iov is smaller then previous failed write. Here is a diff

diff --git a/src/tls.c b/src/tls.c
index 482ad03f4..09512994e 100644
--- a/src/tls.c
+++ b/src/tls.c
@@ -452,6 +452,7 @@ typedef struct tls_connection {
     SSL *ssl;
     char *ssl_error;
     listNode *pending_list_node;
+    size_t last_write_data_len;
 } tls_connection;

 static connection *createTLSConnection(int client_side) {
@@ -911,10 +912,12 @@ static int connTLSWrite(connection *conn_, const void *data, size_t data_len) {
     if (conn->c.state != CONN_STATE_CONNECTED) return -1;
     ERR_clear_error();
     ret = SSL_write(conn->ssl, data, data_len);
+    conn->last_write_data_len = ret < 0 ? data_len : 0;
     return updateStateAfterSSLIO(conn, ret, 1);
 }

 static int connTLSWritev(connection *conn_, const struct iovec *iov, int iovcnt) {
+    tls_connection *conn = (tls_connection *)conn_;
     if (iovcnt == 1) return connTLSWrite(conn_, iov[0].iov_base, iov[0].iov_len);

     /* Accumulate the amount of bytes of each buffer and check if it exceeds NET_MAX_WRITES_PER_EVENT. */
@@ -927,7 +930,7 @@ static int connTLSWritev(connection *conn_, const struct iovec *iov, int iovcnt)
     /* The amount of all buffers is greater than NET_MAX_WRITES_PER_EVENT,
      * which is not worth doing so much memory copying to reduce system calls,
      * therefore, invoke connTLSWrite() multiple times to avoid memory copies. */
-    if (iov_bytes_len > NET_MAX_WRITES_PER_EVENT) {
+    if (iov_bytes_len > NET_MAX_WRITES_PER_EVENT && iovcnt > 0 && iov[0].iov_len >= conn->last_write_data_len) {
         ssize_t tot_sent = 0;
         for (int i = 0; i < iovcnt; i++) {
             ssize_t sent = connTLSWrite(conn_, iov[i].iov_base, iov[i].iov_len);

@towest-nr if you have a time can you try test it on your own?

I also found out that when I also changed tcp_rmem to 4096 issue didn't appear anymore, so to test I have

$ sudo sysctl net.ipv4.tcp_wmem net.ipv4.tcp_rmem
net.ipv4.tcp_wmem = 4096        4096    4096
net.ipv4.tcp_rmem = 4096        131072  6291456

Feb 12 '25 14:02 zori-janea

@madolson I think this is a known issue we once discussed. We know that it is possible that we start writing a large buffer and on reattempts we slowly shrink and then fallback to use the static buf. I think. we once discussed it with @uriyage and decided that is not critical since the replica will disconnect and immediately reconnect... but maybe my old man memory is misleading me...

BTW - I also think this explains failures like: https://github.com/valkey-io/valkey/actions/runs/13268563078/job/37042216535

Feb 12 '25 16:02 ranshid

Yeah, I was finally able to get it reproduced locally. The issue you mentioned is specifically between the primary and replica, which is not ideal to disconnect either. In this case it seems like we are sending errors out in writing the response to clients, which is more concerning since it will cause more material impact to end users.

I also think Zori brought up something that at least I don't remember discussing, which was with moveable buffers, the size must be the same, which we aren't preserving. So at the very least we aren't doing that correctly. I'm going to see if I can tweak the test to get it to consistently fail in the CI. The proposed seems broadly correct to me. @zori-janea Do you want to open a PR? Otherwise I can create it and credit you.

Feb 13 '25 22:02 madolson

Thanks for feedback @madolson. I have created the PR

Feb 14 '25 16:02 zori-janea

triggers this test failure https://github.com/valkey-io/valkey/issues/1747

Feb 18 '25 10:02 xbasel

I managed to reproduce easily after changing:

✗ sudo sysctl net.ipv4.tcp_wmem net.ipv4.tcp_rmem
net.ipv4.tcp_wmem = 1024	4096	4096
net.ipv4.tcp_rmem = 1024	4096	4096

On both:

OpenSSL 3.0.13 30 Jan 2024 (Library: OpenSSL 3.0.13 30 Jan 2024)  
Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.1 LTS
Release:	24.04
Codename:	noble

and

OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.5 LTS
Release:	22.04
Codename:	jammy

Feb 18 '25 11:02 xbasel