haproxy icon indicating copy to clipboard operation
haproxy copied to clipboard

TLS performance regression with 4096 RSA key migrating from HAProxy 2.2 to 2.8

Open alebedev87 opened this issue 3 months ago • 66 comments

Detailed Description of the Problem

A significant TLS performance degradation is observed when migrating from OpenShift 4.12 to 4.17.

  • HAProxy versions involved:
    • OpenShift 4.12: HAProxy 2.2.24
    • OpenShift 4.17: HAProxy 2.8.10
  • TLS key: rsa:4096.
  • Performance testing tool: vegeta.
  • HAProxy access logs show: SSL handshake failure.
  • TLS termination occurs at the edge (frontend). No re-encryption between HAProxy and backend servers.

Vegeta runs:

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         303611, 1686.65, 970.91
Duration      [total, attack, wait]             3m3s, 3m0s, 3.383s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.472ms, 2.898s, 4.13s, 5.226s, 5.403s, 5.843s, 21.245s
Bytes In      [total, mean]                     9615078, 31.67
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           58.65%
Status Codes  [code:count]                      0:125554  200:178057  
Error Set:
Get "https://hello-world.com/": EOF

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         209729, 1164.72, 908.89
Duration      [total, attack, wait]             3m3s, 3m0s, 2.976s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.157ms, 4.27s, 5.086s, 6.367s, 6.787s, 7.464s, 23.773s
Bytes In      [total, mean]                     8983872, 42.84
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           79.33%
Status Codes  [code:count]                      0:43361  200:166368  
Error Set:
Get "https://hello-world.com/": EOF
Get "https://hello-world.com/": dial tcp 0.0.0.0:0->10.0.50.224:443: connect: connection refused

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         328537, 1825.15, 941.73
Duration      [total, attack, wait]             3m3s, 3m0s, 3.429s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.23ms, 2.584s, 2.658s, 5.051s, 5.259s, 5.647s, 29.6s
Bytes In      [total, mean]                     9328284, 28.39
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           52.58%
Status Codes  [code:count]                      0:155791  200:172746  
Error Set:
Get "https://hello-world.com/": EOF
Get "https://hello-world.com/": dial tcp 0.0.0.0:0->10.0.50.224:443: connect: connection refused

HAProxy 2.8.10 access logs:

haproxy[31]: 10.0.11.73:59679 [17/Sep/2025:07:46:55.416] fe_sni~ be_edge_http:test:myedge/pod:web-server-6fd7fc6444-w9ln5:service-unsecure:http:10.128.2.17:8080 0/0/37/25/62 200 253 - - --NI 25842/24728/51/51/0 0/0 "GET / HTTP/1.1"
haproxy[31]: 10.0.11.73:36407 [17/Sep/2025:07:46:55.416] fe_sni~ be_edge_http:test:myedge/pod:web-server-6fd7fc6444-w9ln5:service-unsecure:http:10.128.2.17:8080 0/0/37/25/62 200 253 - - --NI 25841/24727/50/50/0 0/0 "GET / HTTP/1.1"
haproxy[31]: 10.0.11.73:60273 [17/Sep/2025:07:46:55.416] fe_sni~ be_edge_http:test:myedge/pod:web-server-6fd7fc6444-w9ln5:service-unsecure:http:10.128.2.17:8080 0/0/37/25/62 200 253 - - --NI 25841/24727/49/49/0 0/0 "GET / HTTP/1.1"
haproxy[31]: 10.0.11.73:43247 [17/Sep/2025:07:46:25.507] fe_sni/1: SSL handshake failure
haproxy[31]: 10.0.11.73:57489 [17/Sep/2025:07:46:25.499] fe_sni/1: SSL handshake failure
haproxy[31]: 10.0.11.73:59035 [17/Sep/2025:07:46:54.457] public_ssl be_sni/fe_sni 1/32/1047 0 sD 25853/1120/1063/1063/0 0/0
haproxy[31]: 10.0.11.73:40352 [17/Sep/2025:07:46:54.490] public_ssl be_sni/fe_sni 1/0/1014 0 sD 25852/1119/1062/1062/0 0/0
haproxy[31]: 10.0.11.73:38208 [17/Sep/2025:07:46:54.490] public_ssl be_sni/fe_sni 1/0/1014 0 sD 25851/1118/1061/1061/0 0/0

HAProxy process shows excessive CPU usage (~350% on 4 cores node): Image

show tasks shows almost exclusive CPU assignment to ssl_sock_io_cb function:

Running tasks: 24406 (4 threads)
  function                     places     %    lat_tot   lat_avg
  ssl_sock_io_cb                24285   99.5      -         -   
  sc_conn_io_cb                    70    0.2      -         -   
  h1_io_cb                         39    0.1      -         -   
  process_stream                    8    0.0      -         -   
  h1_timeout_task                   4    0.0      -         -   

Running tasks: 24369 (4 threads)
  function                     places     %    lat_tot   lat_avg
  ssl_sock_io_cb                23941   98.2      -         -   
  sc_conn_io_cb                   224    0.9      -         -   
  process_stream                   88    0.3      -         -   
  h1_io_cb                         83    0.3      -         -   
  h1_timeout_task                  16    0.0      -         -   
  session_expire_embryonic          8    0.0      -         -   
  accept_queue_process              4    0.0      -         -   
  other                             3    0.0      -         -   
  xprt_handshake_io_cb              2    0.0      -         -  

Running tasks: 24185 (4 threads)
  function                     places     %    lat_tot   lat_avg
  ssl_sock_io_cb                23884   98.7      -         -   
  sc_conn_io_cb                   165    0.6      -         -   
  h1_io_cb                         69    0.2      -         -   
  process_stream                   62    0.2      -         -   
  h1_timeout_task                   4    0.0      -         -   
  xprt_handshake_io_cb              1    0.0      -         -  

We think that the following issues were showing similar problems:

  • https://github.com/haproxy/haproxy/issues/1914
  • https://github.com/haproxy/haproxy/issues/1988

As with https://github.com/haproxy/haproxy/issues/1914 we managed to narrow the problem down to 4c48edba4f45bb78f41af7d79d3c176710fe6a90 which first appeared in 2.4.-dev12 release. Reverting the commit results in:

  • 100% success rate (no SSL handshake failures)
  • Improved request rate (~11K vs ~2K)
  • Reduced CPU usage

Vegeta runs with 4c48edba4f45bb78f41af7d79d3c176710fe6a90 reverted:

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         2054995, 11414.31, 11405.44
Duration      [total, attack, wait]             3m0s, 3m0s, 140.122ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.898ms, 304.647ms, 284.05ms, 380.766ms, 411.503ms, 473.042ms, 12.395s
Bytes In      [total, mean]                     110969730, 54.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:2054995  
Error Set:

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         2047009, 11372.09, 11368.99
Duration      [total, attack, wait]             3m0s, 3m0s, 49.098ms
Latencies     [min, mean, 50, 90, 95, 99, max]  674.292µs, 287.529ms, 262.527ms, 366.743ms, 398.454ms, 475.019ms, 14.126s
Bytes In      [total, mean]                     110538486, 54.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:2047009  
Error Set:

CPU usage: Image

show tasks (number of task is noticeably less though):

Running tasks: 5005 (4 threads)
  function                     places     %    lat_tot   lat_avg
  ssl_sock_io_cb                 4819   96.2      -         -   
  h1_io_cb                        186    3.7      -         -  

Running tasks: 0 (4 threads)
  function                     places     %    lat_tot   lat_avg

Running tasks: 1 (4 threads)
  function                     places     %    lat_tot   lat_avg                                                
  h1_io_cb                          1  100.0      -         -   

Running tasks: 1 (4 threads)
  function                     places     %    lat_tot   lat_avg                                                
  h1_io_cb                          1  100.0      -         -   

Running tasks: 2 (4 threads)
  function                     places     %    lat_tot   lat_avg
  ssl_sock_io_cb                    2  100.0      -         -   

Running tasks: 3 (4 threads)
  function                     places     %    lat_tot   lat_avg
  sc_conn_io_cb                     2   66.6      -         -   
  process_stream                    1   33.3      -         -   

Running tasks: 3 (4 threads)
  function                     places     %    lat_tot   lat_avg
  sc_conn_io_cb                     2   66.6      -         -   
  process_stream                    1   33.3      -         -   

Running tasks: 2 (4 threads)
  function                     places     %    lat_tot   lat_avg
  h1_io_cb                          1   50.0      -         -   
  ssl_sock_io_cb                    1   50.0      -         -  

Expected Behavior

No SSL handshake failures and similar request and success rates when a 4K RSA key is used migrating from HAProxy 2.2.z to >2.4.

Steps to Reproduce the Behavior

  1. Use HAProxy 2.8.10 or any version starting from 2.4.0.
  2. Use 4K RSA key for the server certificate.
  3. Terminate TLS on the frontend, don't use TLS between HAProxy and server.
  4. Number of threads: 4.
  5. Use vegeta to test the performance (5000 workers, maximum hit rate).
  6. Check the success rate from vegeta and access logs in HAProxy.

Do you have any idea what may have caused this?

Reverting 4c48edba4f45bb78f41af7d79d3c176710fe6a90 on HAProxy 2.8.10 helps to get TLS performance similar to 2.2.

Do you have an idea how to solve the issue?

No response

What is your configuration?

global
  no strict-limits
  maxconn 50000
  nbthread 4                                                                                                    
  daemon
  log /var/lib/rsyslog/rsyslog.sock len 1024 local1 info
  log-send-hostname
  ca-base /etc/ssl
  crt-base /etc/ssl
  stats socket /var/lib/haproxy/run/haproxy.sock mode 600 level admin expose-fd listeners
  stats timeout 2m
  tune.maxrewrite 8192
  tune.bufsize 32768
  ssl-default-bind-options ssl-min-ver TLSv1.2
  tune.ssl.default-dh-param 2048
  ssl-default-bind-ciphers REDACTED                                                                                            
  ssl-default-bind-ciphersuites REDACTED

defaults
  maxconn 50000
  option httplog
  log global                                                                                                    
  errorfile 503 /var/lib/haproxy/conf/error-page-503.http
  errorfile 404 /var/lib/haproxy/conf/error-page-404.http
  timeout connect 5s
  timeout client 30s
  timeout client-fin 1s
  timeout server 30s
  timeout server-fin 1s
  timeout http-request 10s
  timeout http-keep-alive 300s
  timeout tunnel 1h

frontend public_ssl
  option tcplog    
  bind :443 accept-proxy
  tcp-request inspect-delay 5s
  tcp-request content accept if { req_ssl_hello_type 1 }
  acl sni req.ssl_sni -m found
  acl sni_passthrough req.ssl_sni,lower,map_reg(/var/lib/haproxy/conf/os_sni_passthrough.map) -m found
  use_backend %[req.ssl_sni,lower,map_reg(/var/lib/haproxy/conf/os_tcp_be.map)] if sni sni_passthrough
  use_backend be_sni if sni
  default_backend be_no_sni

backend be_sni
  server fe_sni unix@/var/lib/haproxy/run/haproxy-sni.sock weight 1 send-proxy

frontend fe_sni
  bind unix@/var/lib/haproxy/run/haproxy-sni.sock ssl crt /var/lib/haproxy/router/certs/default.pem crt-list /var/lib/haproxy/conf/cert_config.map accept-proxy no-alpn
  mode http
  option idle-close-on-response
  http-request deny if { hdr_len(content-length) 0 }
  http-request del-header Proxy
  http-request set-header Host %[req.hdr(Host),lower]
  use_backend %[base,map_reg(/var/lib/haproxy/conf/os_edge_reencrypt_be.map)]
  default_backend openshift_default

backend be_edge_http:test:myedge
  mode http
  option redispatch
  option forwardfor
  balance random
  timeout check 5000ms
  http-request add-header X-Forwarded-Host %[req.hdr(host)]
  http-request add-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  http-request add-header X-Forwarded-Proto-Version h2 if { ssl_fc_alpn -i h2 }
  http-request add-header Forwarded for=%[src];host=%[req.hdr(host)];proto=%[req.hdr(X-Forwarded-Proto)]
  cookie 851b234ae52ef7709dfb5e5fadb97e8c insert indirect nocache httponly secure attr SameSite=None
  server pod:web-server-6fd7fc6444-w9ln5:service-unsecure:http:10.128.2.17:8080 10.128.2.17:8080 cookie d8f71950c0e59754be3f646cea5d4fce weight 1

Output of haproxy -vv

$ haproxy -vv
HAProxy version 2.8.10-f28885f 2024/06/14 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2028.
Known bugs: http://www.haproxy.org/bugs/bugs-2.8.10.html
Running on: Linux 5.14.0-427.86.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 27 07:03:43 EDT 2025 x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_LINUX_TPROXY=1 USE_CRYPT_H=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_ZLIB=1 USE_PCRE=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_WOLFSSL -OT +PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SHM_OPEN -SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 -SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL +ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=4).
Built with OpenSSL version : OpenSSL 3.0.7 1 Nov 2022
Running on OpenSSL version : OpenSSL 3.0.7 1 Nov 2022
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.44 2020-02-12
Running on PCRE version : 8.44 2020-02-12
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 11.4.1 20231218 (Red Hat 11.4.1-3)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=

Available services : none

Available filters :
	[BWLIM] bwlim-in
	[BWLIM] bwlim-out
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace

Last Outputs and Backtraces


Additional Information

vegeta tests with the CPU monitoring were done against 2.3.21 and 2.4.0 both compiled with OpenSSL1.1. 2.4.0 shows CPU trends similar to 2.8.10 and gets better TLS performance when 4c48edba4f45bb78f41af7d79d3c176710fe6a90 is reverted. 2.3.21 does not show TLS performance problems with 4K RSA key certificate. That's how we narrowed down to 2.4.-dev12 being problematic.

alebedev87 avatar Sep 17 '25 11:09 alebedev87

any chance to test 3.2.latest?

git001 avatar Sep 17 '25 12:09 git001

any chance to test 3.2.latest?

Testing 3.2.latest might not be straightforward. HAProxy 3 is something we haven’t tried yet, so our build configuration may require significant changes. Additionally, even if the issue is fixed in HAProxy 3.2, it could take a while for OpenShift to adopt this version. Since the regression is occurring now, we’d like to focus on a solution that can address it sooner.

alebedev87 avatar Sep 17 '25 12:09 alebedev87

Thank you for your complete report, we're going to investigate this.

wlallemand avatar Sep 19 '25 07:09 wlallemand

any chance to test 3.2.latest?

We updated our build configuration and tested with HAProxy 3.2.4. While we may need additional testing, at first glance the CPU usage and success rate appear similar to what we observed with 2.8.10.

Vegeta runs:

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         349745, 1942.22, 997.56
Duration      [total, attack, wait]             3m4s, 3m0s, 3.731s
Latencies     [min, mean, 50, 90, 95, 99, max]  561.579µs, 2.475s, 3.097s, 4.851s, 5.17s, 6.472s, 29.146s
Bytes In      [total, mean]                     9901278, 28.31
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           52.43%
Status Codes  [code:count]                      0:166388  200:183357  
Error Set:
Get "https://hello-world.com/": EOF
Get "https://hello-world.com/": dial tcp 0.0.0.0:0->10.0.52.233:443: connect: connection reset by peer


$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         383030, 2128.03, 1064.25
Duration      [total, attack, wait]             3m4s, 3m0s, 3.519s
Latencies     [min, mean, 50, 90, 95, 99, max]  572.632µs, 2.18s, 2.427s, 4.254s, 4.515s, 5.08s, 28.945s
Bytes In      [total, mean]                     10546362, 27.53
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           50.99%
Status Codes  [code:count]                      0:187727  200:195303  
Error Set:
Get "https://hello-world.com/": EOF
Image

Access logs contained a new error though (EPIPE returned by OS):

haproxy[25]: 10.0.8.143:41463 [18/Sep/2025:20:29:55.228] fe_sni~ be_edge_http:test:myedge/pod:web-server-6499c75b57-wcw2h:service-unsecure:http:10.129.2.18:8080 0/0/38/18/56 200 253 - - --NI 24872/23625/19/19/0 0/0 "GET / HTTP/1.1"
haproxy[25]: 10.0.8.143:35871 [18/Sep/2025:20:29:55.228] fe_sni~ be_edge_http:test:myedge/pod:web-server-6499c75b57-wcw2h:service-unsecure:http:10.129.2.18:8080 0/0/38/18/56 200 253 - - --NI 24872/23625/18/18/0 0/0 "GET / HTTP/1.1"
haproxy[25]: 10.0.8.143:46318 [18/Sep/2025:20:29:54.289] public_ssl be_sni/fe_sni 1/0/1012 0 sD 24880/1249/1235/1235/0 0/0
haproxy[25]: 10.0.8.143:57061 [18/Sep/2025:20:29:25.301] fe_sni/1: EPIPE returned by OS
haproxy[25]: 10.0.8.143:46151 [18/Sep/2025:20:29:25.318] fe_sni/1: EPIPE returned by OS
. . .
haproxy[25]: 10.0.8.143:43075 [18/Sep/2025:20:29:25.301] fe_sni/1: Timeout during SSL handshake                                                                                
haproxy[25]: 10.0.8.143:49631 [18/Sep/2025:20:29:25.301] fe_sni/1: Timeout during SSL handshake
haproxy[25]: 10.0.8.143:43911 [18/Sep/2025:20:29:25.301] fe_sni/1: Timeout during SSL handshake

This may be something to be re-investigated on our side, since we used the same machines and configuration as in the 2.8.10 tests.

alebedev87 avatar Sep 19 '25 09:09 alebedev87

Thanks for your test. The patch you've found only affects scheduling priorities, not CPU usage. So there might be two issues at once possibly (maybe one uncovers the other one). I'm suspecting that there might be some late asymmetric crypto processing past the end of the handshake and that this is causing the huge latencies, explaining why reverting the patch and forcing to serialize the tasks could improve the situation. However it should not change anything CPU-wise. Maybe for an unknown reason we're seeing extraneous wakeups where it's not possible to work, or worse, where crypto operations are done over and over, I don't know.

During your tests on 3.2, it could be helpful if you could issue this on the CLI:

  • set profiling tasks on # to enable task profiling
  • wait about 10 seconds
  • show profiling tasks

I'm interested in seeing the number of calls to the IO callbacks. I suspect some of them might be insanely looping. And at least we'll see the average CPU consumption per call, which should give us some hints as well.

wtarreau avatar Sep 19 '25 09:09 wtarreau

Also, two series of show info;show activity 10 seconds apart during the test would be very helpful to spot problems. For example we'll be able to see if we're saturated doing new key signs (e.g. maybe the errors are upsetting vegeta which tries a new handshake and only counts successful ones, resulting in a lot of calculations on one side and a lower number on the other one).

wtarreau avatar Sep 19 '25 09:09 wtarreau

Are the tests only in TLS 1.3 ? I'm wondering if we don't have differences with TLS1.2 on post-handshakes.

wlallemand avatar Sep 19 '25 09:09 wlallemand

Also you mentioned OpenSSL 1.1 but I'm seeing OpenSSL 3.0.7 1 Nov 2022 in your haproxy -vv which might be a good reason for your problems.

wlallemand avatar Sep 19 '25 09:09 wlallemand

Also you mentioned OpenSSL 1.1 but I'm seeing OpenSSL 3.0.7 1 Nov 2022 in your haproxy -vv which might be a good reason for your problems.

HAProxy 2.8.10 was running on OpenSSL3. But 2.4.0 (or 2.4-dev12) tests we did were running on OpenSSL1:

Additional Information vegeta tests with the CPU monitoring were done against 2.3.21 and 2.4.0 both compiled with OpenSSL1.1. 2.4.0 shows CPU trends similar to 2.8.10 and gets better TLS performance when 4c48edba4f45bb78f41af7d79d3c176710fe6a90 is reverted. 2.3.21 does not show TLS performance problems with 4K RSA key certificate. That's how we narrowed down to 2.4.-dev12 being problematic.

We considered the OpenSSL difference at the very beginning, and we don’t have evidence that OpenSSL 3 is the root cause. However, if there is a good reason to believe it may be related to OpenSSL, we can try testing 2.8.10 with OpenSSL 1.

alebedev87 avatar Sep 19 '25 10:09 alebedev87

We made a long article about OpenSSL 3.0 problems https://www.haproxy.com/blog/state-of-ssl-stacks. If your previous 2.2 deployment used OpenSSL 1.1.1 and you migrated to HAProxy 2.8 with OpenSSL 3.0, there's absolutely no chance that you will have the same level of performance.

wlallemand avatar Sep 19 '25 10:09 wlallemand

However if you made your 2.4 tests with OpenSSL 1.1.1, there's probably something else. But you should stick with 1.1.1 for your tests to avoid problems related to that.

wlallemand avatar Sep 19 '25 10:09 wlallemand

During your tests on 3.2, it could be helpful if you could issue this on the CLI:

set profiling tasks on # to enable task profiling wait about 10 seconds show profiling tasks I'm interested in seeing the number of calls to the IO callbacks. I suspect some of them might be insanely looping. And at least we'll see the average CPU consumption per call, which should give us some hints as well.

We did several runs of echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head on HAProxy 3.2.4 with OpenSSL1:

Requests      [total, rate, throughput]         292721, 1626.18, 946.99
Success       [ratio]                           59.40%

Requests      [total, rate, throughput]         299801, 1663.30, 1020.26
Success       [ratio]                           62.54%

Requests      [total, rate, throughput]         248612, 1381.25, 1191.59
Success       [ratio]                           87.64%

Requests      [total, rate, throughput]         249379, 1385.37, 1176.66
Success       [ratio]                           86.30%
$ haproxy -vv
HAProxy version 3.2.4-98813a1 2025/08/13 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2030.
Known bugs: http://www.haproxy.org/bugs/bugs-3.2.4.html
Running on: Linux 5.14.0-427.74.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jun 13 12:07:50 EDT 2025 x86_64
Build options : 
  TARGET  = linux-glibc
  CC      = cc
  CFLAGS  = -O2 -g -fwrapv
  OPTIONS = USE_LINUX_TPROXY=1 USE_CRYPT_H=1 USE_GETADDRINFO=1 USE_OPENSSL=1 USE_ZLIB=1 USE_PCRE=1
  DEBUG   = 

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY -LUA -MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT +PCRE -PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL -PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT -SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL +ZLIB +ACME

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=32, MAX_THREADS=1024, default=4).
Built with SSL library version : OpenSSL 1.1.1k  FIPS 25 Mar 2021
Running on SSL library version : OpenSSL 1.1.1k  FIPS 25 Mar 2021
SSL library supports TLS extensions : yes
SSL library supports SNI : yes
SSL library FIPS mode : no
SSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.42 2018-03-20
Running on PCRE version : 8.42 2018-03-20
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 8.5.0 20210514 (Red Hat 8.5.0-10)

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=SPOP  side=BE     mux=SPOP  flags=HOL_RISK|NO_UPG
       spop : mode=SPOP  side=BE     mux=SPOP  flags=HOL_RISK|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : none

Available filters :
	[BWLIM] bwlim-in
	[BWLIM] bwlim-out
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace

Results of show profiling tasks:

Tasks activity over 1306.546 sec till 0.000 sec ago:
  function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
  sc_conn_io_cb+0            10421808   16.62s    1.594us   16.31d    135.2ms <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
  sc_conn_io_cb+0             6207583   1.555m    15.03us   10.13d    141.0ms <- sock_conn_iocb@src/sock.c:1098 tasklet_wakeup
  process_stream+0            4929446   1.063m    12.94us   9.177h    6.702ms <- sc_notify@src/stconn.c:1206 task_wakeup
  process_stream+0            3435716   2.332m    40.72us   3.616h    3.789ms <- stream_new@src/stream.c:583 task_wakeup
  ssl_sock_io_cb              2752977   7.076s    2.570us   4.303d    135.1ms <- sock_conn_iocb@src/sock.c:1098 tasklet_wakeup
  h1_io_cb                    2751430   57.92s    21.05us   3.931d    123.4ms <- ssl_sock_io_cb@src/ssl_sock.c:5812 tasklet_wakeup
  h1_io_cb                    2733164   11.69s    4.278us   1.875m    41.17us <- sock_conn_iocb@src/sock.c:1098 tasklet_wakeup
  xprt_handshake_io_cb        2109333   12.30s    5.833us   12.02m    341.9us <- xprt_handshake_start@src/xprt_handshake.c:140 tasklet_wakeup
  sc_conn_io_cb+0             2018169   7.513s    3.722us   11.88m    353.2us <- h1_wake_stream_for_recv@src/mux_h1.c:3623 tasklet_wakeup
  h1_timeout_task              789116   338.6ms   429.0ns   1.272h    5.805ms <- h1_release@src/mux_h1.c:1382 task_wakeup
  h1_io_cb                     770425   1.684s    2.186us   1.123m    87.43us <- sock_conn_iocb@src/sock.c:1077 tasklet_wakeup
  sc_conn_io_cb+0              770425   16.22s    21.05us   9.315m    725.4us <- h1_wake_stream_for_send@src/mux_h1.c:3633 tasklet_wakeup
  sc_conn_io_cb+0              702309   3.146s    4.479us   7.220m    616.8us <- xprt_handshake_io_cb@src/xprt_handshake.c:92 tasklet_wakeup
  ssl_sock_io_cb               437121   1.070h    8.811ms   17.29d    3.418s  <- ssl_sock_start@src/ssl_sock.c:5008 tasklet_wakeup
  process_stream+0             435748   2.102s    4.824us   8.702m    1.198ms <- run_tasks_from_lists@src/task.c:669 task_queue
  accept_queue_process+0       266364   6.662s    25.01us   21.71m    4.891ms <- listener_accept@src/listener.c:1504 tasklet_wakeup
  session_expire_embryonic+0   252787   8.420s    33.31us   4.854m    1.152ms <- session_accept_fd@src/session.c:344 task_queue
  accept_queue_process+0       231692   4.568s    19.72us   25.87m    6.699ms <- accept_queue_process@src/listener.c:182 tasklet_wakeup
  h1_io_cb                      27884   67.83ms   2.432us   1.996m    4.295ms <- h1_takeover@src/mux_h1.c:5593 tasklet_wakeup
  ssl_sock_io_cb                18820   16.46ms   874.0ns   7.891m    25.16ms <- conn_subscribe@src/connection.c:959 tasklet_wakeup
  xprt_handshake_io_cb          13738   113.4ms   8.251us   1.423m    6.213ms <- sock_conn_iocb@src/sock.c:1098 tasklet_wakeup
  process_stream+0               7246   16.36ms   2.258us   23.47s    3.239ms <- sc_app_chk_snd_conn@src/stconn.c:915 task_wakeup
  process_chk                    5203   443.9ms   85.31us   4.081s    784.3us <- wake_srv_chk@src/check.c:1091 task_wakeup
  other                          2242   1.310ms   584.0ns   2.237m    59.88ms <- session_accept_fd@src/session.c:344 task_queue
  process_chk                    2082   265.8ms   127.7us   1.926s    924.8us <- run_tasks_from_lists@src/task.c:669 task_queue
  h1_io_cb                       1229   1.144ms   930.0ns   7.987ms   6.499us <- h1_shutw_conn@src/mux_h1.c:4581 tasklet_wakeup
  h1_io_cb                        954   11.08ms   11.61us   310.7ms   325.7us <- conn_subscribe@src/connection.c:959 tasklet_wakeup
  srv_chk_io_cb                   516   1.421ms   2.753us   128.1ms   248.3us <- ssl_sock_io_cb@src/ssl_sock.c:5812 tasklet_wakeup
  h1_timeout_task                 387   33.74ms   87.18us   34.15ms   88.25us <- h1_init@src/mux_h1.c:1315 task_queue
  ssl_sock_io_cb                  282   7.771ms   27.56us   15.55ms   55.13us <- sock_conn_iocb@src/sock.c:1077 tasklet_wakeup
  task_process_applet             137   142.5ms   1.040ms   70.21ms   512.5us <- sc_applet_create@src/stconn.c:573 appctx_wakeup
  task_process_applet              98   221.5us   2.260us   80.54ms   821.8us <- sc_app_shut_applet@src/stconn.c:976 appctx_wakeup
  srv_cleanup_idle_conns           61   189.1us   3.099us   17.04ms   279.4us <- run_tasks_from_lists@src/task.c:669 task_queue
  srv_cleanup_toremove_conns       26   4.544ms   174.8us   63.95ms   2.459ms <- srv_cleanup_idle_conns@src/server.c:7351 task_wakeup
  srv_cleanup_idle_conns           14   9.675us   691.0ns   4.785ms   341.8us <- srv_add_to_idle_list@src/server.c:7262 task_schedule
  srv_cleanup_toremove_conns        2   52.55us   26.27us   20.36ms   10.18ms <- connect_server@src/backend.c:1890 task_wakeup
  task_process_applet               1   636.1us   636.1us   109.8us   109.8us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup

Note: there were other runs of vegeta with ~100, ~300 and ~1000 rps before the 4 ~5000 runs I showed at the top.

alebedev87 avatar Sep 19 '25 13:09 alebedev87

Also, two series of show info;show activity 10 seconds apart during the test would be very helpful to spot problems.

On the same HAProxy 3.2.4 with OpenSSL1.

First run:

Name: HAProxy
Version: 3.2.4-98813a1
Release_date: 2025/08/13
Nbthread: 4
Nbproc: 1
Process_num: 1
Pid: 15
Uptime: 0d 0h44m00s
Uptime_sec: 2640
Memmax_MB: 0
PoolAlloc_MB: 30
PoolUsed_MB: 30
PoolFailed: 0
Ulimit-n: 100055
Maxsock: 100055
Maxconn: 50000
Hard_maxconn: 50000
CurrConns: 22962
CumConns: 1708051
CumReq: 3751932
MaxSslConns: 0
CurrSslConns: 21506
CumSslConns: 852340
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 1721
ConnRateLimit: 0
MaxConnRate: 5752
SessRate: 1717
SessRateLimit: 0
MaxSessRate: 5752
SslRate: 863
SslRateLimit: 0
MaxSslRate: 2876
SslFrontendKeyRate: 0
SslFrontendMaxKeyRate: 3034
SslFrontendSessionReuse_pct: 100
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 10
SslCacheLookups: 0
SslCacheMisses: 0
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 23025
Run_queue: 20956
Idle_pct: 0
node: router-test-75ffc8746-gq4rn
Stopping: 0
Jobs: 22961
Unstoppable Jobs: 0
Listeners: 5
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 1211949
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 4141035696
TotalSplicedBytesOut: 0
BytesOutRate: 1450304
DebugCommandsIssued: 0
CumRecvLogs: 0
Build info: 3.2.4-98813a1
Memmax_bytes: 0
PoolAlloc_bytes: 32075104
PoolUsed_bytes: 32075104
Start_time_sec: 1758286022
Tainted: 0
TotalWarnings: 86
MaxconnReached: 0
BootTime_ms: 10
Niced_tasks: 0
CurrStreams: 1443
CumStreams: 3751544
BlockedTrafficWarnings: 0

thread_id: 1 (1..4)
date_now: 1758288662.938673
uptime_now: 2640.368699
ctxsw: 46214508 [ 11700473 11610510 11548189 11355336 ]
tasksw: 10901458 [ 2754468 2738272 2725568 2683150 ]
empty_rq: 1561 [ 556 338 296 371 ]
long_rq: 268098 [ 67948 67541 66643 65966 ]
curr_rq: 20956 [ 5231 5250 5238 5237 ]
loops: 2733438 [ 685989 682538 687609 677302 ]
wake_tasks: 268192 [ 67975 67563 66663 65991 ]
wake_signal: 0 [ 0 0 0 0 ]
poll_io: 2711853 [ 680902 676583 682377 671991 ]
poll_exp: 0 [ 0 0 0 0 ]
poll_drop_fd: 10541 [ 2968 2414 2408 2751 ]
poll_skip_fd: 0 [ 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 ]
stream_calls: 9769366 [ 2474046 2454321 2446205 2394794 ]
pool_fail: 0 [ 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 ]
cpust_ms_tot: 784627 [ 183893 197567 200842 202325 ]
cpust_ms_1s: 476 [ 112 159 101 104 ]
cpust_ms_15s: 8639 [ 1925 2441 1932 2341 ]
avg_cpu_pct: 100 [ 100 100 100 100 ]
avg_loop_us: 18774 [ 18381 19134 18342 19239 ]
accepted: 1699883 [ 426671 424509 426546 422157 ]
accq_pushed: 1699883 [ 424969 423786 427854 423274 ]
accq_full: 0 [ 0 0 0 0 ]
accq_ring: 8 [ 0 5 3 0 ]
fd_takeover: 33334 [ 8185 8558 8096 8495 ]
check_adopted: 40 [ 12 4 14 10 ]
check_started: 8176 [ 2166 1544 2351 2115 ]
check_active: 0 [ 0 0 0 0 ]
check_running: 0 [ 0 0 0 0 ]

~10 seconds later:

Name: HAProxy
Version: 3.2.4-98813a1
Release_date: 2025/08/13
Nbthread: 4
Nbproc: 1
Process_num: 1
Pid: 15
Uptime: 0d 0h44m10s
Uptime_sec: 2650
Memmax_MB: 0
PoolAlloc_MB: 35
PoolUsed_MB: 35
PoolFailed: 0
Ulimit-n: 100055
Maxsock: 100055
Maxconn: 50000
Hard_maxconn: 50000
CurrConns: 26812
CumConns: 1724469
CumReq: 3768869
MaxSslConns: 0
CurrSslConns: 25472
CumSslConns: 860550
Maxpipes: 0
PipesUsed: 0
PipesFree: 0
ConnRate: 1343
ConnRateLimit: 0
MaxConnRate: 5752
SessRate: 1308
SessRateLimit: 0
MaxSessRate: 5752
SslRate: 664
SslRateLimit: 0
MaxSslRate: 2876
SslFrontendKeyRate: 0
SslFrontendMaxKeyRate: 3034
SslFrontendSessionReuse_pct: 100
SslBackendKeyRate: 0
SslBackendMaxKeyRate: 10
SslCacheLookups: 0
SslCacheMisses: 0
CompressBpsIn: 0
CompressBpsOut: 0
CompressBpsRateLim: 0
ZlibMemUsage: 0
MaxZlibMemUsage: 0
Tasks: 26882
Run_queue: 25396
Idle_pct: 0
node: router-test-75ffc8746-gq4rn
Stopping: 0
Jobs: 26769
Unstoppable Jobs: 0
Listeners: 5
ActivePeers: 0
ConnectedPeers: 0
DroppedLogs: 1216974
BusyPolling: 0
FailedResolutions: 0
TotalBytesOut: 4154929041
TotalSplicedBytesOut: 0
BytesOutRate: 1126336
DebugCommandsIssued: 0
CumRecvLogs: 0
Build info: 3.2.4-98813a1
Memmax_bytes: 0
PoolAlloc_bytes: 36929904
PoolUsed_bytes: 36929904
Start_time_sec: 1758286022
Tainted: 0
TotalWarnings: 86
MaxconnReached: 0
BootTime_ms: 10
Niced_tasks: 0
CurrStreams: 1294
CumStreams: 3768481
BlockedTrafficWarnings: 0

thread_id: 1 (1..4)
date_now: 1758288673.192806
uptime_now: 2650.622832
ctxsw: 46448259 [ 11759828 11665761 11606544 11416126 ]
tasksw: 10968993 [ 2771530 2754083 2742590 2700790 ]
empty_rq: 1561 [ 556 338 296 371 ]
long_rq: 270154 [ 68481 68032 67154 66487 ]
curr_rq: 25395 [ 6232 6467 6440 6256 ]
loops: 2735494 [ 686522 683029 688120 677823 ]
wake_tasks: 270248 [ 68508 68054 67174 66512 ]
wake_signal: 0 [ 0 0 0 0 ]
poll_io: 2713909 [ 681435 677074 682888 672512 ]
poll_exp: 0 [ 0 0 0 0 ]
poll_drop_fd: 10543 [ 2969 2414 2408 2752 ]
poll_skip_fd: 0 [ 0 0 0 0 ]
conn_dead: 0 [ 0 0 0 0 ]
stream_calls: 9828206 [ 2488924 2468054 2461102 2410126 ]
pool_fail: 0 [ 0 0 0 0 ]
buf_wait: 0 [ 0 0 0 0 ]
cpust_ms_tot: 791536 [ 185463 199366 202732 203975 ]
cpust_ms_1s: 1013 [ 228 259 402 124 ]
cpust_ms_15s: 9639 [ 2103 2648 2444 2444 ]
avg_cpu_pct: 100 [ 100 100 100 100 ]
avg_loop_us: 19750 [ 18932 20648 19765 19654 ]
accepted: 1716316 [ 430931 428437 430631 426317 ]
accq_pushed: 1716316 [ 429115 427772 432007 427422 ]
accq_full: 0 [ 0 0 0 0 ]
accq_ring: 46 [ 0 46 0 0 ]
fd_takeover: 33392 [ 8195 8581 8109 8507 ]
check_adopted: 40 [ 12 4 14 10 ]
check_started: 8203 [ 2169 1547 2361 2126 ]
check_active: 0 [ 0 0 0 0 ]
check_running: 0 [ 0 0 0 0 ]

alebedev87 avatar Sep 19 '25 13:09 alebedev87

Very useful, thank you! So what we can see already:

  • for 1.2M connections you're seeing around 10M wakeups of the connection iocb just from the receive callback, that's 8 wakeups per connection. It seems high to me but nothing terrifying.
  • two wakeups of the SSL iocb on avg per connection seems reasonable
  • the average latency of most tasks is huge (hundreds of ms), and it definitely explains why reverting the patch makes things go smoother: in that case only one SSL context will be handled per polling loop, so the rest is much more fluid. This also proves that the SSL wakeups are quite expensive, though not all of them given that the cpu_avg remains in the microseconds range.
  • the average loop time is huge (almost 20 ms), it should stay in the microseconds to tens of microseconds range, at worst, 100+ us.
  • the runqueue size is abnormal: 25k tasks, that's roughly 5 tasks per connection, and everything runs to completion with such a size
  • the connection rate is quite low (664/s) and these are only resumed, not even rekeyed.

It behaves as if something wouldn't stop after a test, e.g. connections would fail and stay in the failed state forever, eating CPU. I suspect that after a run is complete, haproxy's CPU stays high. It really sounds like an incorrectly caught error. Unfortunately when these happen, they can be extremely sensitive to the timings and environment, so it may take a while before we manage to reproduce it. But we can also count on luck.

In any case, I can confirm that at such low rates, it shouldn't eat CPU like that at all!

BTW I remember in the past, we noticed that having large dhparams could seriously affect performance. I'm not seeing anything standing out here so it's hard to say, but if you have a 4096 dhparam in your certs file, it might be useful to try again with a lower one to see if it is in any way related to this or not.

Also the output of a perf top during a test could help as well.

wtarreau avatar Sep 19 '25 15:09 wtarreau

Thank you for the swift reply!

I suspect that after a run is complete, haproxy's CPU stays high

We cannot confirm that. CPU usage drops back down once the test run completes. Correction: HAProxy does keep high CPU for about 30 seconds after the run completes. I apologize - I didn’t pay attention to the CPU immediately after the test run. Now that you pointed this out, I have double-checked and confirmed it.

but if you have a 4096 dhparam in your certs file, it might be useful to try again with a lower one to see if it is in any way related to this or not

The cert/key PEM files we used for the tests don’t contain a DH PARAMETERS block. We rely only on tune.ssl.default-dh-param, which is set to 2048 in the global section. This setting isn’t exposed to end users, so it should never exceed 2048.

Also the output of a perf top during a test could help as well.

We will try to gather this data.

alebedev87 avatar Sep 22 '25 07:09 alebedev87

Also the output of a perf top during a test could help as well.

HAProxy 3.2.4 with OpenSSL1 was tested using -max-workers 5000 -rate 0 --duration=1m:

    38.08%  haproxy  libcrypto.so.1.1.1k  [.] __bn_sqrx8x_reduction
    21.29%  haproxy  libcrypto.so.1.1.1k  [.] bn_sqrx8x_internal
    19.89%  haproxy  libcrypto.so.1.1.1k  [.] mulx4x_internal
     1.75%  haproxy  [kernel.kallsyms]    [k] masked_flow_lookup
     1.56%  haproxy  libcrypto.so.1.1.1k  [.] __bn_postx4x_internal
     1.01%  haproxy  [kernel.kallsyms]    [k] __do_softirq
     0.60%  haproxy  libcrypto.so.1.1.1k  [.] bn_mulx4x_mont
     0.42%  haproxy  libcrypto.so.1.1.1k  [.] fe_mul
     0.38%  haproxy  [kernel.kallsyms]    [k] __nf_conntrack_find_get
     0.33%  haproxy  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
     0.30%  haproxy  libcrypto.so.1.1.1k  [.] x25519_fe64_sqr
     0.25%  haproxy  libcrypto.so.1.1.1k  [.] cmov
     0.18%  haproxy  [kernel.kallsyms]    [k] _raw_spin_lock
     0.18%  haproxy  libcrypto.so.1.1.1k  [.] x25519_fe64_mul
     0.16%  haproxy  libc-2.28.so         [.] _int_malloc
     0.16%  haproxy  libc-2.28.so         [.] malloc
     0.15%  haproxy  libcrypto.so.1.1.1k  [.] OPENSSL_cleanse
     0.15%  haproxy  [kernel.kallsyms]    [k] nft_do_chain
     0.15%  haproxy  [kernel.kallsyms]    [k] ___slab_alloc
     0.14%  haproxy  libcrypto.so.1.1.1k  [.] fe_sq
     0.14%  haproxy  [kernel.kallsyms]    [k] finish_task_switch.isra.0
     0.14%  haproxy  libcrypto.so.1.1.1k  [.] bn_powerx5
     0.13%  haproxy  [kernel.kallsyms]    [k] kmem_cache_free
     0.13%  haproxy  [kernel.kallsyms]    [k] syscall_enter_from_user_mode
     0.12%  haproxy  libcrypto.so.1.1.1k  [.] bn_div_fixed_top
     0.12%  haproxy  [kernel.kallsyms]    [k] exit_to_user_mode_loop
     0.11%  haproxy  libc-2.28.so         [.] _int_free
     0.11%  haproxy  libcrypto.so.1.1.1k  [.] bn_mul_add_words

HAProxy 3.2.4 with OpenSSL1 was tested using -max-workers 5000 -rate 0 --duration=3m:

    37.59%  haproxy  libcrypto.so.1.1.1k  [.] __bn_sqrx8x_reduction
    21.01%  haproxy  libcrypto.so.1.1.1k  [.] bn_sqrx8x_internal
    19.55%  haproxy  libcrypto.so.1.1.1k  [.] mulx4x_internal
     1.67%  haproxy  [kernel.kallsyms]    [k] masked_flow_lookup
     1.56%  haproxy  libcrypto.so.1.1.1k  [.] __bn_postx4x_internal
     1.01%  haproxy  [kernel.kallsyms]    [k] __do_softirq
     0.60%  haproxy  libcrypto.so.1.1.1k  [.] bn_mulx4x_mont
     0.44%  haproxy  [kernel.kallsyms]    [k] __nf_conntrack_find_get
     0.42%  haproxy  libcrypto.so.1.1.1k  [.] fe_mul
     0.35%  haproxy  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
     0.30%  haproxy  libcrypto.so.1.1.1k  [.] x25519_fe64_sqr
     0.23%  haproxy  libcrypto.so.1.1.1k  [.] cmov
     0.23%  haproxy  libc-2.28.so         [.] _int_malloc
     0.21%  haproxy  [kernel.kallsyms]    [k] _raw_spin_lock
     0.18%  haproxy  libcrypto.so.1.1.1k  [.] x25519_fe64_mul
     0.17%  haproxy  [kernel.kallsyms]    [k] nft_do_chain
     0.17%  haproxy  libc-2.28.so         [.] malloc
     0.16%  haproxy  [kernel.kallsyms]    [k] ___slab_alloc
     0.16%  haproxy  libcrypto.so.1.1.1k  [.] OPENSSL_cleanse
     0.15%  haproxy  [kernel.kallsyms]    [k] kmem_cache_free
     0.14%  haproxy  [kernel.kallsyms]    [k] syscall_enter_from_user_mode
     0.14%  haproxy  [kernel.kallsyms]    [k] finish_task_switch.isra.0
     0.14%  haproxy  libcrypto.so.1.1.1k  [.] fe_sq
     0.14%  haproxy  libcrypto.so.1.1.1k  [.] bn_powerx5
     0.13%  haproxy  [kernel.kallsyms]    [k] __fget_light
     0.13%  haproxy  libc-2.28.so         [.] _int_free
     0.12%  haproxy  haproxy              [.] process_stream
     0.12%  haproxy  [kernel.kallsyms]    [k] exit_to_user_mode_loop

I hope I managed to resolve libcrypto symbols well. It was not quite straight forward from inside OpenShift.

It looks like most of the CPU time is spent on TLS handshakes. The bn (Big Number) and mulx4x prefixed functions should be part of the handshake implementation.

alebedev87 avatar Sep 22 '25 17:09 alebedev87

how is openssl configured and built ? I recall accidental installation from freebsd ports with "assembler optimization" disabled (not in purpose, just checked wrong checkbox), and due to that big numbers math took ~25% more cpu than it should

chipitsine avatar Sep 22 '25 18:09 chipitsine

HAProxy 3.2.4 with OpenSSL1 but with the problematic commit (4c48edba4f45bb78f41af7d79d3c176710fe6a90) reverted gives 100% success rate with the same parameters (5000/1m, 5000/3m):

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=1m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         579218, 9651.16, 9645.89
Duration      [total, attack, wait]             1m0s, 1m0s, 32.307ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.777ms, 299.099ms, 193.156ms, 323.798ms, 377.102ms, 695.592ms, 11.714s
Bytes In      [total, mean]                     31278052, 54.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:579213  408:5  
Error Set:
408 Request Time-out

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         2011351, 11173.16, 11162.89
Duration      [total, attack, wait]             3m0s, 3m0s, 165.03ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.769ms, 235.119ms, 199.623ms, 328.015ms, 367.911ms, 461.493ms, 12.239s
Bytes In      [total, mean]                     108613290, 54.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:2011345  408:6  
Error Set:
408 Request Time-out

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         2000227, 11108.64, 11098.97
Duration      [total, attack, wait]             3m0s, 3m0s, 153.49ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.762ms, 238.002ms, 198.027ms, 328.518ms, 368.636ms, 461.825ms, 14.085s
Bytes In      [total, mean]                     108014274, 54.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:2000191  408:36  
Error Set:
408 Request Time-out

And shows the following perf top output:

     5.66%  haproxy  [kernel.kallsyms]    [k] masked_flow_lookup
     5.53%  haproxy  libcrypto.so.1.1.1k  [.] __bn_sqrx8x_reduction
     4.02%  haproxy  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
     3.10%  haproxy  libcrypto.so.1.1.1k  [.] bn_sqrx8x_internal
     2.88%  haproxy  libcrypto.so.1.1.1k  [.] mulx4x_internal
     2.78%  haproxy  [kernel.kallsyms]    [k] __do_softirq
     1.37%  haproxy  [kernel.kallsyms]    [k] finish_task_switch.isra.0
     1.36%  haproxy  [kernel.kallsyms]    [k] syscall_enter_from_user_mode
     1.33%  haproxy  [kernel.kallsyms]    [k] __nf_conntrack_find_get
     1.09%  haproxy  haproxy              [.] sess_build_logline_orig
     1.08%  haproxy  [kernel.kallsyms]    [k] nft_do_chain
     0.88%  haproxy  haproxy              [.] h1_headers_to_hdr_list
     0.86%  haproxy  [kernel.kallsyms]    [k] __fget_light
     0.84%  haproxy  haproxy              [.] sock_conn_iocb
     0.83%  haproxy  [kernel.kallsyms]    [k] _raw_spin_lock
     0.76%  haproxy  haproxy              [.] h1_snd_buf
     0.76%  haproxy  haproxy              [.] process_stream
     0.71%  haproxy  haproxy              [.] sc_conn_recv
     0.58%  haproxy  haproxy              [.] run_tasks_from_lists
     0.56%  haproxy  haproxy              [.] __pool_alloc
     0.56%  haproxy  haproxy              [.] __http_find_header
     0.54%  haproxy  [kernel.kallsyms]    [k] do_epoll_ctl
     0.54%  haproxy  libc-2.28.so         [.] vfprintf
     0.49%  haproxy  [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
     0.48%  haproxy  [kernel.kallsyms]    [k] avc_lookup
     0.45%  haproxy  haproxy              [.] h1_process_demux
     0.43%  haproxy  haproxy              [.] _do_poll
     0.42%  haproxy  haproxy              [.] sc_notify

alebedev87 avatar Sep 22 '25 18:09 alebedev87

OK thanks for the measurements. Now it's getting pretty clear to me what is happening: for a reason we still don't know, the SSL lib is not completely done with asymmetric crypto when we remove the flag so it occupies all the CPU, causing extreme latencies that make requests timeout. In this case the load generator has no other option but disconnecting and reconnecting, further increasing the asymmetric crypto cost, and making the situation even worse.

I'll see what I can propose to postpone the removal of that flag. I think we could do it once we receive the first byte of data, which should definitely confirm that the crypto is complete. I'll check how to do this.

wtarreau avatar Sep 22 '25 19:09 wtarreau

OK thanks for the measurements. Now it's getting pretty clear to me what is happening: for a reason we still don't know, the SSL lib is not completely done with asymmetric crypto when we remove the flag so it occupies all the CPU

That could be something like the computation of a new TLS ticket which is sent after the handshake.

Are the tests only in TLS 1.3 ? I'm wondering if we don't have differences with TLS1.2 on post-handshakes.

That's why I was asking about that, we maybe have made the tests in TLS 1.2 when doing the mentioned patch, if this is related to this.

wlallemand avatar Sep 22 '25 20:09 wlallemand

Are the tests only in TLS 1.3 ? I'm wondering if we don't have differences with TLS1.2 on post-handshakes.

Most likely, yes. vegeta does not provide a flag to set max TLS version, so golang's crypto/tls#Config.MinVersion and MaxVersion are used which are TLS 1.2 and TLS 1.3 respectively. Our HAProxy selects TLS 1.3 ciphersuite which should result into TLS 1.3. I didn't do a pcap on this. If you would like to confirm that I can do it.

alebedev87 avatar Sep 22 '25 21:09 alebedev87

I'll see what I can propose to postpone the removal of that flag. I think we could do it once we receive the first byte of data, which should definitely confirm that the crypto is complete. I'll check how to do this.

Let us know if any testing is needed. We would be happy to give a try to any patch which tries to fix this issue.

alebedev87 avatar Sep 22 '25 21:09 alebedev87

If you could try the following patch, that would be helpful. It only drops the HEAVY flag once bytes are received over the socket, indicating that we don't need expensive operations anymore.

postpone-heavy.diff.txt

wtarreau avatar Sep 23 '25 02:09 wtarreau

@alebedev87 I'm trying to reproduce. I'm puzzled because you have "accept-proxy" on your bind line, so vegeta cannot attack it this way. Did you disable it or are you passing through another component ? Also, the "redacted" ciphers do not help to reproduce the issue, could you please at least indicate which one suffices to reproduce the problem ? You could for example log the ciphers in haproxy and place the equivalent ones in the conf.

wtarreau avatar Sep 23 '25 07:09 wtarreau

I'm facing difficulties with vegeta, it closes the source ports in a dirty way and leaves tons of TIME_WAIT sockets, which quickly end up eating all the source ports on the machine until we cannot connect anymore. I guess I'll have to try it on another machine :-(

wtarreau avatar Sep 23 '25 07:09 wtarreau

I'm trying to reproduce. I'm puzzled because you have "accept-proxy" on your bind line, so vegeta cannot attack it this way. Did you disable it or are you passing through another component ?

We haven’t disabled accept-proxy. The tests are running with the standard OpenShift HAProxy configuration. The traffic is routed through an AWS load balancer, which is the default setup to publish the OpenShift router.

You could for example log the ciphers in haproxy and place the equivalent ones in the conf.

ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256

alebedev87 avatar Sep 23 '25 07:09 alebedev87

If you could try the following patch, that would be helpful. It only drops the HEAVY flag once bytes are received over the socket, indicating that we don't need expensive operations anymore. postpone-heavy.diff.txt

We applied this patch on HAProxy 3.2.4/OpenSSL1:

  • CPU behavior looks similar to the version without the patch.
  • Same about perf top
  • vegeta results got better, the success rate never went below 90% (UPD: latencies are high and request rates are low")
$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=1m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         39830, 663.77, 518.14
Duration      [total, attack, wait]             1m11s, 1m0s, 10.835s
Latencies     [min, mean, 50, 90, 95, 99, max]  4.572ms, 8.002s, 8.325s, 9.38s, 19.873s, 24.449s, 30.009s
Bytes In      [total, mean]                     2311410, 58.03
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           92.15%
Status Codes  [code:count]                      0:131  200:36705  408:2994  
Error Set:
408 Request Time-out
Get "https://hello-world.com/": dial tcp 0.0.0.0:0->10.0.75.141:443: connect: connection refused

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         99907, 555.04, 508.92
Duration      [total, attack, wait]             3m9s, 3m0s, 9.443s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.676s, 9.236s, 9.803s, 10.374s, 10.5s, 22.223s, 30.003s
Bytes In      [total, mean]                     5574364, 55.80
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           96.50%
Status Codes  [code:count]                      0:149  200:96411  408:3347  
Error Set:
408 Request Time-out
Get "https://hello-world.com/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         97160, 539.76, 498.42
Duration      [total, attack, wait]             3m7s, 3m0s, 6.587s
Latencies     [min, mean, 50, 90, 95, 99, max]  23.63ms, 9.431s, 8.454s, 11.49s, 25.328s, 30.725s, 31.365s
Bytes In      [total, mean]                     5041908, 51.89
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           95.72%
Status Codes  [code:count]                      0:3978  200:93002  408:180  
Error Set:
Get "https://hello-world.com/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Get "https://hello-world.com/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         114057, 633.59, 588.50
Duration      [total, attack, wait]             3m8s, 3m0s, 7.637s
Latencies     [min, mean, 50, 90, 95, 99, max]  52.896ms, 8.068s, 8.092s, 9.816s, 12.267s, 30.605s, 31.62s
Bytes In      [total, mean]                     5968660, 52.33
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           96.82%
Status Codes  [code:count]                      0:3575  200:110435  408:47  
Error Set:
Get "https://hello-world.com/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Get "https://hello-world.com/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         87248, 484.69, 410.80
Duration      [total, attack, wait]             3m22s, 3m0s, 22.244s
Latencies     [min, mean, 50, 90, 95, 99, max]  2.034s, 10.692s, 10.696s, 12.115s, 13.412s, 30s, 30.009s
Bytes In      [total, mean]                     4810210, 55.13
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           95.23%
Status Codes  [code:count]                      0:1221  200:83085  408:2942  
Error Set:
408 Request Time-out
Get "https://hello-world.com/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

$ echo "GET https://hello-world.com/" | vegeta attack -max-workers 5000 -rate 0 -keepalive=true --duration=3m -insecure | tee results.bin | vegeta report | head
Requests      [total, rate, throughput]         159508, 886.14, 841.66
Duration      [total, attack, wait]             3m5s, 3m0s, 5.42s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.539ms, 5.71s, 5.761s, 6.504s, 6.681s, 20.496s, 30.005s
Bytes In      [total, mean]                     8753662, 54.88
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           97.84%
Status Codes  [code:count]                      0:479  200:156063  408:2966
Error Set:
408 Request Time-out
Get "https://hello-world.com/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

perf top:

  40.15%  haproxy  libcrypto.so.1.1.1k  [.] __bn_sqrx8x_reduction
  22.43%  haproxy  libcrypto.so.1.1.1k  [.] bn_sqrx8x_internal
  20.89%  haproxy  libcrypto.so.1.1.1k  [.] mulx4x_internal
   1.66%  haproxy  libcrypto.so.1.1.1k  [.] __bn_postx4x_internal
   0.82%  haproxy  [kernel.kallsyms]    [k] masked_flow_lookup
   0.78%  haproxy  [kernel.kallsyms]    [k] __do_softirq
   0.64%  haproxy  libcrypto.so.1.1.1k  [.] bn_mulx4x_mont
   0.44%  haproxy  libcrypto.so.1.1.1k  [.] fe_mul
   0.32%  haproxy  libcrypto.so.1.1.1k  [.] x25519_fe64_sqr
   0.28%  haproxy  libc-2.28.so         [.] _int_malloc
   0.25%  haproxy  libcrypto.so.1.1.1k  [.] cmov
   0.23%  haproxy  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.19%  haproxy  libcrypto.so.1.1.1k  [.] x25519_fe64_mul
   0.18%  haproxy  [kernel.kallsyms]    [k] __nf_conntrack_find_get
   0.16%  haproxy  libcrypto.so.1.1.1k  [.] OPENSSL_cleanse
   0.15%  haproxy  libcrypto.so.1.1.1k  [.] fe_sq
   0.15%  haproxy  libc-2.28.so         [.] malloc
   0.14%  haproxy  libcrypto.so.1.1.1k  [.] bn_powerx5
   0.13%  haproxy  [kernel.kallsyms]    [k] _raw_spin_lock
   0.13%  haproxy  libcrypto.so.1.1.1k  [.] bn_div_fixed_top
   0.13%  haproxy  [kernel.kallsyms]    [k] finish_task_switch.isra.0
   0.12%  haproxy  libcrypto.so.1.1.1k  [.] bn_mul_add_words
   0.12%  haproxy  libc-2.28.so         [.] _int_free
   0.11%  haproxy  [kernel.kallsyms]    [k] nft_do_chain
   0.11%  haproxy  [kernel.kallsyms]    [k] exit_to_user_mode_loop
   0.10%  haproxy  [kernel.kallsyms]    [k] syscall_enter_from_user_mode

alebedev87 avatar Sep 23 '25 11:09 alebedev87

We applied this patch on HAProxy 3.4.2/OpenSSL1:

I'm pretty sure this is a typo and it's HAProxy 3.2.4, right.

git001 avatar Sep 23 '25 13:09 git001

Interesting that the latency statistics reported by vegata are much lower when the original patch is reverted, yet the new test patch shows similar latency as seen in the original patch. Moreover, new suggested patch provides better success rates. That is surprising to me.

rohara avatar Sep 23 '25 15:09 rohara

Thanks @alebedev87 that really helps. Makes sense for accept-proxy, I thought you were injecting directly to haproxy itself. We'll retry with your settings. @rohara I'm not drawing any conclusion on this. The fact is that as long as too many handshakes are in the queue, latencies and timeouts will occur, and timeouts will cause reconnections hence new handshakes hence higher CPU usage etc. It's just spiraling down towards the bad situation that the very first patch introducing TASK_HEAVY meant to address. @wlallemand suggested something interesting regarding server-initiated tickets that would be emitted after the request is received, meaning that haproxy here would still have to perform some crypto operations at this moment and that this could explain this observation. If you can easily run a quick tests with "no-tls-tickets" to disable them, it could help confirm/reject this hypothesis. That doesn't mean that it suddenly gives us an easy solution though, but it helps narrow it down. The main difficulty I'm seeing is to know whether or not the TLS layer still needs to perform slow crypto or not. Because as you see we can't afford to release the flag too early, but we cannot afford to keep it forever either, it really ruins TLS performance.

Also I'm wondering if we're really witnessing crypto on send here. Indeed, unless we've saturated the client's TCP window with tons of data, there's no reason to pass via the scheduler to send data. Normally it should be synchronous from the stream layer. So here I'm suspecting we're having to do that upon a certain Rx operation that we need to identify.

wtarreau avatar Sep 23 '25 19:09 wtarreau