Envoy endpoints not loading since Chrome release 124
Title: Envoy not responding since release of Chrome 124
Description: As part of the release of Chrome version 124, Google seems to have widely enabled post-quantum secure TLS key encapsulation Kyber768 for TLS 1.3
Repro steps: In Chrome flags chrome://flags/ check if "TLS 1.3 hybridized Kyber support" is enabled From Chrome, connect to any TLS endpoint served by Envoy, and it gets stuck on a spinning wheel until if fails.
Note: I was running a very old version of Envoy (1.15) but I upgraded all the way to 1.30.1
I would appreciate any suggestions on this issue, I am not well versed on Envoy.
cc @ggreenway who may know more about the different TLS protocols that are supported.
I would find it shocking if chrome wouldn't be willing to negotiate one of the older TLS 1.3 ciphers in this case; I'd guess that most TLS endpoints on the internet don't yet support the post-quantum cipher suites. Regardless, I think if there's an issue here, it's a bug in chrome.
I would find it shocking if chrome wouldn't be willing to negotiate one of the older TLS 1.3 ciphers in this case; I'd guess that most TLS endpoints on the internet don't yet support the post-quantum cipher suites. Regardless, I think if there's an issue here, it's a bug in chrome.
The problem has made some news website like MSN, I'm not sure what's the policy about posting links so I'll post an excerpt
Despite months of testing, the problem seems to have risen from web servers failing to adequately implement TLS, rather than an issue with Chrome. The error results in the rejection of connections that use the Kyber768 quantum-resistant key agreement algorithm, including connections with Chrome’s hybrid key. Clearly, this is not a simple fix that can be implemented by Chrome, but it requires a larger and more orchestrated effort to transform the Internet into one that can handle sophisticated quantum-safe cryptography. For now, affected users are being advised to disable the TLS 1.3 hybridized Kyber support in Chrome. However, long-term post-quantum secure ciphers will be essential in TLS, and the ability to disable the feature will likely be removed in the future, highlighting the importance of addressing the issue’s route cause earlier on so that websites can be prepared for quantum-based attacks in the future.
I might have found some helpful comment on DDG
These errors are not caused by a bug in Google Chrome but instead caused by web servers failing to properly implement Transport Layer Security (TLS) and not being able to handle larger ClientHello messages for post-quantum cryptography.
I think I saw a setting that affects this size in Envoy
I can't find such option, any suggestions?
Can you capture a full tcpdump of the failed handshake and post it?
@ggreenway it doesn't seem to accept pcap files, any suggestions?
Huh, it looks like the tcp window is closed after only 1400 bytes. Can you post the full envoy configuration you used for this test?
Huh, it looks like the tcp window is closed after only 1400 bytes. Can you post the full envoy configuration you used for this test?
1400 bytes seems like a typical MTU, maybe we're limiting handshake to a single packet?
Here's a cut-down version of our config, I hope I didn't axe too much
---
admin:
# access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
### BEGIN http frontends ###
- name: apis
address:
socket_address: { address: 0.0.0.0, port_value: 443 }
listener_filters:
- name: "envoy.filters.listener.tls_inspector"
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
filter_chains:
- filter_chain_match:
server_names: ["*.testdomain.dev"]
transport_protocol: "tls"
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_params:
tls_minimum_protocol_version: TLSv1_2
cipher_suites: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305"
tls_certificates:
- certificate_chain:
filename: /etc/envoy/STAR.testdomain.dev.crt
private_key:
filename: /etc/envoy/STAR.testdomain.dev.key
filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
upgrade_configs:
- upgrade_type: connect
codec_type: AUTO
use_remote_address: true
xff_num_trusted_hops: 0
access_log:
- name: envoy.access_loggers.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: "/dev/stdout"
route_config:
name: local_route
virtual_hosts:
- name: system_api
domains: ["api.testdomain.dev"]
routes:
- match: { prefix: "/api/interact/" }
route: { cluster: windfarm }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: windfarm
connect_timeout: 0.25s
type: STATIC
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
health_checks:
timeout: 1s
interval: 1s
unhealthy_threshold: 1
healthy_threshold: 2
http_health_check:
path: /ping
load_assignment:
cluster_name: windfarm
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 29091
https://tldr.fail/ describes the issue. It is likely that Envoy is not reading the entire Client Hello as it spans packets.
Python test scripts can be found here:
github.com/dadrian/tldr.fail/blob/main/tldr_fail_test.py
@vparla thanks for that. I used the linked python script to test against the latest Envoy and it worked correctly (Envoy sent back a ServerHello).
@cd-fernando can you try the script at https://github.com/dadrian/tldr.fail/blob/main/tldr_fail_test.py and see what it returns?
Can anyone else test this and report success or failure?
I got one more report of it working correctly with Chrome.
I think I understand what's going on: the TlsInspector filter doesn't read from the socket; it peeks. This means the entire ClientHello needs to fit into the configured socket read buffer size.
I saw in the tcpdump that the server had a fully filled up tcp window (of only about 1500 bytes), and I didn't realize until now that it was the TlsInspector, not the TLS transport socket, that was getting stuck.
I've never seen a socket read buffer configured that small. What OS are you using?
If you don't need to select a filter chain based on SNI, you can remove the TlsInspector from your config and that should fix this.
Thanks everyone for looking into this. @ggreenway you were right, we had a TCP max window size too small, it was set to 4096 and that wasn't enough.
For a bit more background in case you're curious it's a value we had set to optimise Haproxy and that only, unfortunately because our QA boxes are self contained, that setting affected Envoy too. It had never been a problem until the enabling of these Quantum resistant protocols.
Thank you very much again!
FYI that script doesn't seem to work on versions of Python older than 3.11.
Thank you very much for your assistance, I'm closing this now.
Just so I can document this on tldr.fail---where is the socket read buffer configured in this context? Is that an Envoy, TLS Inspector, or kernel setting (or somewhere else)? I would have expected that to be internal to Envoy's implementation, but it didn't seem like this needed a code change?
@dadrian it's the kernel socket receive buffer. On linux it's normally set with sysctl. This is a shortcoming in how Envoy implements this, but it's extremely uncommon to have such a small socket receive buffer, and changing Envoy to handle this condition is not simple, so until someone decides to put in the effort to fix it, I think this will remain as a known issue.
@ggreenway if I'm reading https://github.com/envoyproxy/envoy/issues/33850#issuecomment-2092033930, it sounds like the kernel fix is not sufficient. If the second half of the ClientHello happens to arrive nontrivially later, e.g. due to packet loss, the kernel will release the first half of the ClientHello to the application and only return the second half later. That would mean that Envoy servers will be unreliable when connection to post-quantum-capable clients, including 100% of desktop Chrome.
Is that correct? If so, is there a bug somewhere that tracks making Envoy post-quantum-ready?
Talking to Google Envoy folks, it sounds like I misunderstood the bug. Would be good to confirm that you all indeed retry correctly when the second packet comes in late, but it sounds like it's probably fine? Sorry for the (probably) false alarm!
The tls_inspector waits for new data on the socket; everytime new data arrives, it reads it and feeds it into SSL_do_handshake(). If it either gets an error condition from this call, or it receives the callback set with SSL_CTX_set_tlsext_servername_callback, it marks itself as complete, passes the appropriate data to the filter chain matching in Envoy, and the connection proceeds.
It does not matter how many packets the ClientHello arrives in, as long as the ClientHello is less than the tls_inspector configured limit (configurable; defaults to 64KB) and the ClientHello fits in the kernel socket receive buffer (because this part of the code is using MSG_PEEK and doesn't remove the ClientHello from the socket read buffer).
There's a unit-test here that delivers the ClientHello 1 byte at a time.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.