hail icon indicating copy to clipboard operation
hail copied to clipboard

[ci,gateway,internal-gateway] Move gateways from nginx to envoy

Open daniel-goldstein opened this issue 1 year ago • 4 comments

This is a multi-stage overhaul of our Kubernetes load balancers / service discovery. This involves moving off of NGINX onto Envoy, but more importantly involves better control of what namespaces and services are active in our cluster at a given point in time.

TL;DR Switching from NGINX to Envoy with CI acting as the "control plane" for our internal networking allows us to more easily dynamically configure our Kubernetes networking and achieve proper connection pooling/load-balancing over TLS, which translates to less resource consumption and lower request latencies.

Motivation

This is primarily a performance-motivated change, and one largely based on our (ab)use of NGINX in order to work with our dynamically-generated Kubernetes test namespaces. Currently, we configure NGINX by creating server blocks that dynamically resolve and dispatch requests based on matching regular expressions on the host and path headers. This is in large part due that at gateway deploy time we do not statically know all of the namespaces and namespace-service combinations that will exist in the cluster in the future. This is true for default, but not test namespaces, and NGINX will refuse to start with statically-configured clusters that it cannot reach. Making the server blocks make the routing decisions dynamically circumvents this limitation.

However, this prevents usage of NGINX upstream blocks that provide connection pooling, at least in the community edition, and as a result the gateways will create and terminate a TCP connection per http request. This likely causes minor delays on the front-end through gateway, but this hampers performance greatly in job scheduling. The batch driver is forced to establish a new TCP connection and do an SSL handshake with the internal-gateway multiple times per job, which is expensive and slow. We currently have to dedicate a 2-core NGINX sidecar for the batch-driver just to terminate TLS with internal-gateway and free up cycles in the batch-driver python process. By using proper persistent connections, we can reduce the TLS overhead to single-digit percents of a core.

This leads to the first goal of this transition: configure our load balancers to know the full cluster configuration at any point in time so they can properly maintain connection pools with upstream services.

However, this is not the only problem. Each "upstream" Service in Kubernetes may consist of multiple underlying pods but Kubernetes Services as we use them don't provide proper load-balancing when mixed with persistent connections. When we declare a Service for say, batch in default, Kubernetes adds a DNS record for batch.default that resolves to a single IP pointing at kube-proxy. When a new TCP connection is established with kube-proxy for that IP, it rolls the dice (using iptables) and assigns that connection to a particular pod to which it will forward all subsequent packets. From the load-balancer's perspective, there is only one IP address, and only one place to open connections. The load-balancer doesn't have the information to actually load-balance once we have a functioning connection pool. This can lead to really unbalanced scenarios when preemptible pods come and go.

This leads to our second goal: instead of routing all requests through kube-proxy, use Kubernetes Headless Services to expose all pod IPs underlying a Service so that our proxies can properly load-balance across persistent connections.

Solution

This PR addresses the two goals outlined above and does so through using Envoy, a load-balancer/proxy that is well-suited to this sort of highly-dynamic cluster configuration. Envoy does not have the constraint that all upstream services must be available at start-time, and has a very convenient API for updating the cluster configuration without the need for restarting the process or dropping traffic. This makes regularly updating the cluster configuration whenever new test namespaces are created relatively straightforward and non-disruptive to traffic in other namespaces. The high-level approach is as follows:

  1. Envoy-based gateways and internal-gateways will load their routing configuration from a Kubernetes ConfigMap, which they watch for changes and reconcile their configuration when the ConfigMap changes. The ConfigMap can be populated with a manual deploy and is populated from the beginning with production routes (i.e. batch.hail.is gets routed to batch.default)
  2. When running CI, CI will regularly update the ConfigMap with additional routes based on which internal namespaces (dev and PR) are currently active. This requires relatively small changes to CI to track active namespaces but overall is a pretty small change. Note that this does not introduce a dependency on CI to support production traffic, only development traffic.
  3. Deployments that run more than 1 replica (but really can be all of them) are run behind Headless Services, which expose the underlying pod IPs so Envoy can handle load-balancing instead of kube-proxy. This allows Envoy to make smart load-balancing decisions and correctly enforce rate-limiting when using connection pools.

The namespace tracking in CI in Point 2 is possible before we make any changes to our networking, so that comes first in #12093. Point 3 is taken care of in #12094, and the rest of Point 2 and Point 1, everything to do with Envoy, is in this PR.

Additional QoL improvements

  • Envoy by default exposes Prometheus metrics that we can use to easily monitor things like rate-limiting, request failures and durations
  • Since all Envoy configuration is in the configmap, we don't need to build any images. I suppose we could have done this with NGINX, so this isn't something to fault NGINX for. Just another small win buried in these changes.

cc @danking

daniel-goldstein avatar Aug 17 '22 20:08 daniel-goldstein

Do you need to replace all of the port numbers in the actual services (i.e. batch/front_end.py) as well?

jigold avatar Sep 22 '22 20:09 jigold

Do you need to replace all of the port numbers in the actual services (i.e. batch/front_end.py) as well?

Do you mean this because of the ENVOY_UID change? Restricting Envoy to an unprivileged user is something they've done in their docker image, and it's not something that we do for our k8s services (though maybe we should?). As such, our python services can bind to 443 all they like, and I have changed them to do so already in this PR, though we can put them on any port we want as long as we're consistent.

daniel-goldstein avatar Sep 22 '22 20:09 daniel-goldstein

@daniel-goldstein After this lands in main, can we do a lets encrypt rotation just to smoke test that flow while this is all front of find? Thanks!

danking avatar Sep 22 '22 21:09 danking

Do you need to replace all of the port numbers in the actual services (i.e. batch/front_end.py) as well?

Do you mean this because of the ENVOY_UID change? Restricting Envoy to an unprivileged user is something they've done in their docker image, and it's not something that we do for our k8s services (though maybe we should?). As such, our python services can bind to 443 all they like, and I have changed them to do so already in this PR, though we can put them on any port we want as long as we're consistent.

I meant you changed ports to 8443 and 8080. Do you need to change the services ports?

jigold avatar Sep 22 '22 21:09 jigold

@jigold I stood up batch/ci in my own project from hail-is/hail:main and then deployed this branch, taking notes of any changes I needed to make and all seemed to work out OK. I think that's about as much as I can properly test this without trying things out in haildev/hail-vdc. The steps were as follows:

  1. Generate the configmaps used by gateway/internal-gateway. These will have the routing configuration for production services (I've edited the bootstrap instructions to match) make -C gateway envoy-xds-config && make -C internal-gateway envoy-xds-config
  2. … wait a few seconds for CI to quietly update these configmaps with information about testing namespaces … (can manually verify changes with download-configmap gateway-xds-config)
  3. Deploy the new versions of gateway/internal-gateway make -C gateway deploy NAMESPACE=default && make -C internal-gateway NAMESPACE=default

This worked for me in my project with no downtime, but either way I would probably do the same thing as with the previous PR where I test it in azure before making changes to hail-vdc.

daniel-goldstein avatar Oct 27 '22 15:10 daniel-goldstein

Mitigations if anything goes wrong is pretty trivial: just redeploy gateway/internal-gateway from an earlier commit on main

daniel-goldstein avatar Oct 27 '22 15:10 daniel-goldstein

Sounds good. Will review now and hopefully have it done early this afternoon.

jigold avatar Oct 27 '22 15:10 jigold

I'm good with this.

@danking Did you want to take one last look especially at the feature parity list with Nginx that Daniel put together?

jigold avatar Oct 27 '22 17:10 jigold