Support HTTPs Active Health Checks For TLS Passthrough Upstreams
Description
Envoy gateway supports health checks for applications using TLS termination by creating an HTTPRoute and a BackendTrafficPolicy](https://gateway.envoyproxy.io/latest/api/extension_types/#backendtrafficpolicy).
However, this setup does not work for TLS passthrough. When we define a health check for a TLS passthrough Listerner that has an associated TLSRoute we get the following error.
2024-08-09T01:45:23.429Z ERROR gateway-api runner/runner.go:186 unable to validate xds ir, skipped sending it {"runner": "gateway-api", "error": "field HTTPHealthChecker.Host should be specified"}
TLS passthrough is common in many organizations where app teams dont want to share the cert in the cluster where envoy gateway is running or where latency is a concern and they do not wish to decrypt and rencrypt. It is common for an organization to have a common CA cert chain that can be used to validate any certificates created by the compnay's CA, in such cases, appt teams can configure it via a configmap or it can be supplied by cluster admins by default.
This feature request aims to add support for active health checks for tls passthrough applications by using the same mechanism as TLS termination listeners with HTTPRoutes.
The fix seems simple in theory but the devil is on the details. We need to setup the health check hostname for TLS routes just as we do for HTTPRoutes.
I would like to work on it. It might take a little while before I am allowed by my compnay's policy to contribute to it
seems like a valid use to support we probably need to add this line https://github.com/envoyproxy/gateway/blob/eeb62c88f8949f8da8a1278ec5515ffa1a004444/internal/gatewayapi/backendtrafficpolicy.go#L400 for TCP
@arkodg it does not seem that straight forward after looking into it because ir.TCPRoute struct does not have a Hostname filed. Adding one is an option but it would mean creating a separate ir.TCPRoute for each hostname in the TCP listener similar how it is done for HTTPRoutes where a separate ir.HTTPRoute is created per hostname in the listener.
A separate option but perhaps not ideal is set the hostname entry on the health check based on the first SNI hostname inside ir.TCPRoute.TLS.TLSInspectorConfig.SNIs . That however does not seem super stable because there is always a possibility that the first hostname in the SNI config is no longer present in one of the application cert's SANs
What do you think?
hey @muzcategui1106, I think it's fine to set the hostname to ir.TCPRoute.TLS.TLSInspectorConfig.SNIs[0] if ir.TCPRoute.TLS != nil && .......
if that assumption made is wrong, we can revisit the HealthCheck API in the future
wdyt @envoyproxy/gateway-maintainers
@muzcategui1106 does the feature work if we delete the ErrHCHTTPHostInvalid check ?
@arkodg no it does not. there is no hostname field in the health check and the endpoint just becomes UNHEALTHY
I tried doing the the SNI fix we proposed above. That sort of worked, the endpoints became HEALTHY and I saw traffic flowing to my upstream pod coming from envoy-proxy. However, the TLS passthrough listener broke. See the below output. Not sure what is going on there yet.
curl -v --resolve "passthrough.example.com:6443:10.244.0.28" https://passthrough.example.com:6443 --cacert ca.crt
* Added passthrough.example.com:6443:10.244.0.28 to DNS cache
* Hostname passthrough.example.com was found in DNS cache
* Trying 10.244.0.28:6443...
* Connected to passthrough.example.com (10.244.0.28) port 6443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* CAfile: ca.crt
* CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* (5454) (IN), , Unknown (72):
* error:0A00010B:SSL routines::wrong version number
* Closing connection 0
curl: (35) error:0A00010B:SSL routines::wrong version number
For context we can replicate this by
- Setting up this example . Add a SAN to the cert otherwise the healthcheck wont work
- create a configmap with the ca cert so you can run the health check TLS verification
- add the SNI fix so that healt check validation passes in envoy-gateway
- Create the following objects
apiVersion: gateway.networking.k8s.io/v1alpha3
kind: BackendTLSPolicy
metadata:
name: tlsroute
namespace: default
spec:
targetRefs:
- group: ''
kind: Service
name: passthrough-echoserver
validation:
caCertificateRefs:
- name: ca
group: ''
kind: ConfigMap
hostname: passthrough.example.com
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: tlsroute
namespace: default
spec:
healthCheck:
active:
healthyThreshold: 1
http:
path: /
method: GET
expectedStatuses: [200, 302]
interval: 2s
timeout: 10s
type: HTTP
unhealthyThreshold: 3
targetRef:
group: gateway.networking.k8s.io
kind: TLSRoute
name: tlsroute
@muzcategui1106 can you make the health check e/p http only ?
the addition of BackendTLSPolicy for passthrough-echoserver is adding another tls socket to the cluster which is breaking tls passthrough (which is pure tcp)
@arkodg not sure if I follow. I have the following TLSRoute
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
name: tlsroute
namespace: default
spec:
hostnames:
- passthrough.example.com
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: eg
rules:
- backendRefs:
- group: ""
kind: Service
name: passthrough-echoserver
port: 443
weight: 1
the backend service ultimately directs to an echo server that expects a TLS connection. I can create another port within the service to go the the httpPort but that does not help me here because I cannot configure the TLSRoute to go to a port that expects TLS and a port that doesnt. When connecting to the listener I will ultimately intermittent failures because I would be hitting the http port half of the time
I can configure another TLS route but that would ultimately defeat the purpose of the health check because I am not health checking the targets of the TLS route that is attached to the TLS listener.
Let me know if I am completely wrong here
youre right @muzcategui1106, for TLS Passthrough, the healh check is limited to tcp checks with the current API
what would be your recommendation to enable this in incremental steps. I would not mind working on smaller features first to accomplish this goal. But yeah the way the API is designed does not seem to make it trivial to do :/
Seems like we can use transport_socket_match_criteria at the healthcheck level. I am going to try to use EnvoyPatchPolicy to see if I can patch the envoy proxy configuration with a new transport socket to match the healthcheck
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto
Seems like we can use
transport_socket_match_criteriaat the healthcheck level. I am going to try to use EnvoyPatchPolicy to see if I can patch the envoy proxy configuration with a new transport socket to match the healthcheckhttps://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto
yeah that option should help here. I'm a -1 on exposing this at the API level, because this is not a common use imo, but hoping you can still achieve this using EnvoyPatchPolicy.
lets keep this issue open and see if any others in the community also have such a requirement
I think having a TLS passthrough application where we can still do health checks is super useful. Otherwise how can we know that the upstreams are available to receive traffic? you will end up routing to a bad upstream and getting intermittent connections. In the K8S world where we have readiness checks at the pod level that take the endpoint out of the service is not an issue at all; however, when interacting with services outside of the cluster it is definitely an issue
That being said, you are right this is something that would need to be exposed at the API level, and given that I have been the only person requesting for it perhaps it is not a good idea to do so at this point in time :/ . Hopefully this issue gets traction over time
For posterity.
IMO we should be able to define separate TLS options for the health checks on the BackendTrafficPolicy something along the lines of
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: tlsroute
namespace: default
spec:
healthCheck:
active:
healthyThreshold: 1
http:
path: /
method: GET
expectedStatuses: [200, 302]
interval: 2s
timeout: 10s
type: HTTP
unhealthyThreshold: 3
tlsOptions:
validation:
caCertificateRefs:
- name: appca
group: ''
kind: ConfigMap
hostname: passthrough.example.com
targetRef:
group: gateway.networking.k8s.io
kind: TLSRoute
name: tlsroute
Or something similar that lets the user specify things like a different SNI to use, or different ca bundle.
When enabled envoy should
- create a transport_socket specifically for the health check
- have the healthcheck match the transport_socket created
- the transport socket only gets used for the health check, regular traffic through the cluster MUST not be modified
I ultimately made it work by applying these patches
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyPatchPolicy
metadata:
name: tlsroute-healthcheck
namespace: default
spec:
targetRef:
group: gateway.networking.k8s.io
kind: Gateway
name: eg
type: JSONPatch
jsonPatches:
- type: "type.googleapis.com/envoy.config.cluster.v3.Cluster"
name: tlsroute/default/tlsroute/rule/-1
operation:
op: add
path: "/health_checks"
value:
- healthyThreshold : 1
httpHealthCheck:
expectedStatuses:
- start: "200"
end: "201"
host: passthrough.example.com
method: GET
path: /
interval: 2s
timeout: 10s
unhealthyThreshold: 3
transportSocketMatchCriteria:
healthcheck: true
- type: type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret
name: tlsroute-backend-policy-test/default-ca2
operation:
op: add
value:
name: tlsroute-backend-policy-test/default-ca2
validationContext:
trustedCa:
inlineBytes: "xxxxxxxxxx"
- type: "type.googleapis.com/envoy.config.cluster.v3.Cluster"
name: tlsroute/default/tlsroute/rule/-1
operation:
op: replace
path: "/transport_socket_matches"
value:
- match:
healthcheck: true
name: "tlsroute/default/tlsroute/rule/-1/tls/0"
transportSocket:
name: "envoy.transport_sockets.tls"
typedConfig:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
commonTlsContext:
combinedValidationContext:
defaultValidationContext:
matchTypedSubjectAltNames:
- matcher:
exact: "passthrough.example.com"
sanType: "DNS"
validationContextSdsSecretConfig:
name: "tlsroute-backend-policy-test/default-ca2"
sdsConfig:
ads: {}
resourceApiVersion: "V3"
sni: "passthrough.example.com"
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
This issue has been automatically marked as stale because it has not had activity in the last 30 days.