gateway icon indicating copy to clipboard operation
gateway copied to clipboard

Support HTTPs Active Health Checks For TLS Passthrough Upstreams

Open muzcategui1106 opened this issue 1 year ago • 14 comments

Description

Envoy gateway supports health checks for applications using TLS termination by creating an HTTPRoute and a BackendTrafficPolicy](https://gateway.envoyproxy.io/latest/api/extension_types/#backendtrafficpolicy).

However, this setup does not work for TLS passthrough. When we define a health check for a TLS passthrough Listerner that has an associated TLSRoute we get the following error.

2024-08-09T01:45:23.429Z ERROR gateway-api runner/runner.go:186 unable to validate xds ir, skipped sending it {"runner": "gateway-api", "error": "field HTTPHealthChecker.Host should be specified"}

TLS passthrough is common in many organizations where app teams dont want to share the cert in the cluster where envoy gateway is running or where latency is a concern and they do not wish to decrypt and rencrypt. It is common for an organization to have a common CA cert chain that can be used to validate any certificates created by the compnay's CA, in such cases, appt teams can configure it via a configmap or it can be supplied by cluster admins by default.

This feature request aims to add support for active health checks for tls passthrough applications by using the same mechanism as TLS termination listeners with HTTPRoutes.

The fix seems simple in theory but the devil is on the details. We need to setup the health check hostname for TLS routes just as we do for HTTPRoutes.

Validation Function

muzcategui1106 avatar Aug 23 '24 13:08 muzcategui1106

I would like to work on it. It might take a little while before I am allowed by my compnay's policy to contribute to it

muzcategui1106 avatar Aug 23 '24 13:08 muzcategui1106

seems like a valid use to support we probably need to add this line https://github.com/envoyproxy/gateway/blob/eeb62c88f8949f8da8a1278ec5515ffa1a004444/internal/gatewayapi/backendtrafficpolicy.go#L400 for TCP

arkodg avatar Aug 23 '24 16:08 arkodg

@arkodg it does not seem that straight forward after looking into it because ir.TCPRoute struct does not have a Hostname filed. Adding one is an option but it would mean creating a separate ir.TCPRoute for each hostname in the TCP listener similar how it is done for HTTPRoutes where a separate ir.HTTPRoute is created per hostname in the listener.

A separate option but perhaps not ideal is set the hostname entry on the health check based on the first SNI hostname inside ir.TCPRoute.TLS.TLSInspectorConfig.SNIs . That however does not seem super stable because there is always a possibility that the first hostname in the SNI config is no longer present in one of the application cert's SANs

What do you think?

muzcategui1106 avatar Aug 25 '24 18:08 muzcategui1106

hey @muzcategui1106, I think it's fine to set the hostname to ir.TCPRoute.TLS.TLSInspectorConfig.SNIs[0] if ir.TCPRoute.TLS != nil && ....... if that assumption made is wrong, we can revisit the HealthCheck API in the future wdyt @envoyproxy/gateway-maintainers

arkodg avatar Aug 26 '24 22:08 arkodg

@muzcategui1106 does the feature work if we delete the ErrHCHTTPHostInvalid check ?

arkodg avatar Aug 26 '24 23:08 arkodg

@arkodg no it does not. there is no hostname field in the health check and the endpoint just becomes UNHEALTHY

I tried doing the the SNI fix we proposed above. That sort of worked, the endpoints became HEALTHY and I saw traffic flowing to my upstream pod coming from envoy-proxy. However, the TLS passthrough listener broke. See the below output. Not sure what is going on there yet.

curl -v --resolve "passthrough.example.com:6443:10.244.0.28" https://passthrough.example.com:6443 --cacert ca.crt
* Added passthrough.example.com:6443:10.244.0.28 to DNS cache
* Hostname passthrough.example.com was found in DNS cache
*   Trying 10.244.0.28:6443...
* Connected to passthrough.example.com (10.244.0.28) port 6443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: ca.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* (5454) (IN), , Unknown (72):
* error:0A00010B:SSL routines::wrong version number
* Closing connection 0
curl: (35) error:0A00010B:SSL routines::wrong version number

For context we can replicate this by

  • Setting up this example . Add a SAN to the cert otherwise the healthcheck wont work
  • create a configmap with the ca cert so you can run the health check TLS verification
  • add the SNI fix so that healt check validation passes in envoy-gateway
  • Create the following objects
apiVersion: gateway.networking.k8s.io/v1alpha3
kind: BackendTLSPolicy
metadata:
  name: tlsroute
  namespace: default
spec:
  targetRefs:
  - group: ''
    kind: Service
    name: passthrough-echoserver
  validation:
    caCertificateRefs:
    - name: ca
      group: ''
      kind: ConfigMap
    hostname: passthrough.example.com
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: tlsroute
  namespace: default
spec:
  healthCheck:
    active:
      healthyThreshold: 1
      http:
        path: /
        method: GET
        expectedStatuses: [200, 302] 
      interval: 2s
      timeout: 10s
      type: HTTP
      unhealthyThreshold: 3
  targetRef:
    group: gateway.networking.k8s.io
    kind: TLSRoute
    name: tlsroute

muzcategui1106 avatar Aug 27 '24 22:08 muzcategui1106

@muzcategui1106 can you make the health check e/p http only ?

the addition of BackendTLSPolicy for passthrough-echoserver is adding another tls socket to the cluster which is breaking tls passthrough (which is pure tcp)

arkodg avatar Aug 27 '24 22:08 arkodg

@arkodg not sure if I follow. I have the following TLSRoute

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
  name: tlsroute
  namespace: default
spec:
  hostnames:
  - passthrough.example.com
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: eg
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: passthrough-echoserver
      port: 443
      weight: 1

the backend service ultimately directs to an echo server that expects a TLS connection. I can create another port within the service to go the the httpPort but that does not help me here because I cannot configure the TLSRoute to go to a port that expects TLS and a port that doesnt. When connecting to the listener I will ultimately intermittent failures because I would be hitting the http port half of the time

I can configure another TLS route but that would ultimately defeat the purpose of the health check because I am not health checking the targets of the TLS route that is attached to the TLS listener.

Let me know if I am completely wrong here

muzcategui1106 avatar Aug 28 '24 01:08 muzcategui1106

youre right @muzcategui1106, for TLS Passthrough, the healh check is limited to tcp checks with the current API

arkodg avatar Aug 28 '24 18:08 arkodg

what would be your recommendation to enable this in incremental steps. I would not mind working on smaller features first to accomplish this goal. But yeah the way the API is designed does not seem to make it trivial to do :/

muzcategui1106 avatar Aug 28 '24 19:08 muzcategui1106

Seems like we can use transport_socket_match_criteria at the healthcheck level. I am going to try to use EnvoyPatchPolicy to see if I can patch the envoy proxy configuration with a new transport socket to match the healthcheck

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto

muzcategui1106 avatar Aug 28 '24 21:08 muzcategui1106

Seems like we can use transport_socket_match_criteria at the healthcheck level. I am going to try to use EnvoyPatchPolicy to see if I can patch the envoy proxy configuration with a new transport socket to match the healthcheck

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto

yeah that option should help here. I'm a -1 on exposing this at the API level, because this is not a common use imo, but hoping you can still achieve this using EnvoyPatchPolicy.

lets keep this issue open and see if any others in the community also have such a requirement

arkodg avatar Aug 28 '24 21:08 arkodg

I think having a TLS passthrough application where we can still do health checks is super useful. Otherwise how can we know that the upstreams are available to receive traffic? you will end up routing to a bad upstream and getting intermittent connections. In the K8S world where we have readiness checks at the pod level that take the endpoint out of the service is not an issue at all; however, when interacting with services outside of the cluster it is definitely an issue

That being said, you are right this is something that would need to be exposed at the API level, and given that I have been the only person requesting for it perhaps it is not a good idea to do so at this point in time :/ . Hopefully this issue gets traction over time

For posterity.

IMO we should be able to define separate TLS options for the health checks on the BackendTrafficPolicy something along the lines of

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: tlsroute
  namespace: default
spec:
  healthCheck:
    active:
      healthyThreshold: 1
      http:
        path: /
        method: GET
        expectedStatuses: [200, 302] 
      interval: 2s
      timeout: 10s
      type: HTTP
      unhealthyThreshold: 3
      tlsOptions:
        validation:
          caCertificateRefs:
          - name: appca
            group: ''
            kind: ConfigMap
          hostname: passthrough.example.com
  targetRef:
    group: gateway.networking.k8s.io
    kind: TLSRoute
    name: tlsroute

Or something similar that lets the user specify things like a different SNI to use, or different ca bundle.

When enabled envoy should

  • create a transport_socket specifically for the health check
  • have the healthcheck match the transport_socket created
  • the transport socket only gets used for the health check, regular traffic through the cluster MUST not be modified

I ultimately made it work by applying these patches

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyPatchPolicy
metadata:
  name: tlsroute-healthcheck
  namespace: default
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: eg
  type: JSONPatch
  jsonPatches:
    - type: "type.googleapis.com/envoy.config.cluster.v3.Cluster"
      name: tlsroute/default/tlsroute/rule/-1
      operation:
        op: add
        path: "/health_checks"
        value:
          - healthyThreshold : 1
            httpHealthCheck:
              expectedStatuses:
                - start: "200"
                  end: "201"
              host: passthrough.example.com
              method: GET
              path: /
            interval: 2s
            timeout: 10s
            unhealthyThreshold: 3
            transportSocketMatchCriteria:
              healthcheck: true
    - type: type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret
      name: tlsroute-backend-policy-test/default-ca2
      operation:
        op: add
        value:
          name: tlsroute-backend-policy-test/default-ca2
          validationContext:
            trustedCa:
              inlineBytes: "xxxxxxxxxx"
    - type: "type.googleapis.com/envoy.config.cluster.v3.Cluster"
      name: tlsroute/default/tlsroute/rule/-1
      operation:
        op: replace
        path: "/transport_socket_matches"
        value:
          - match:
              healthcheck: true
            name: "tlsroute/default/tlsroute/rule/-1/tls/0"
            transportSocket:
              name: "envoy.transport_sockets.tls"
              typedConfig:
                "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
                commonTlsContext:
                  combinedValidationContext:
                    defaultValidationContext:
                      matchTypedSubjectAltNames:
                        - matcher: 
                            exact: "passthrough.example.com"
                          sanType: "DNS"
                    validationContextSdsSecretConfig:
                      name: "tlsroute-backend-policy-test/default-ca2"
                      sdsConfig:
                        ads: {}
                        resourceApiVersion: "V3"
                sni: "passthrough.example.com"

muzcategui1106 avatar Aug 29 '24 17:08 muzcategui1106

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Sep 28 '24 20:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Nov 09 '24 08:11 github-actions[bot]