hail
hail copied to clipboard
[k8s] Move to headless services and make pods listen on 443
Moves multi-pod deployments over to using Headless Services, which enables client-side load-balancing to the underlying pods. See #12095 for more context.
The reason I put this in its own PR is that Kubernetes won't let me apply the clusterIP: None
changes to existing Services
, and I must delete the Service
resources first. I can manually delete and apply new headless services in a way that is compatible with what is currently on main and with just a few seconds of downtime, but I should do this manually just before this PR merges.
Can you add the links you found where they suggested using the headless services approach for upstream connections?
Here's the link to the headless service documentation and here's an example blog post where someone encountered the same issues we were facing with normal services. I think the documentation is motivation enough though: Envoy's STRICT_DNS
setting would be considered a form of "service discovery" done through DNS. In order for Envoy to correctly make load balancing decisions, that DNS request should return all the viable IPs for an upstream instead of a single IP that points to kube proxy. Headless services do just that.
The *
means that route will be triggered for any request matching the specified URL for any method, be it GET or POST, etc. The reason I needed to make that change is that when envoy makes an authentication request to that endpoint, it uses the HTTP method of the original request. E.g. If I make a POST to https://internal.hail.is/dgoldste/batch/batches/create envoy will authenticate me with a POST request to auth:443/api/v1alpha/verify_dev_credentials. So I can't set that endpoint to be any one method.
I think port 443 is so we don't need root privileges in Envoy?
This is related to the way headless services expose the pod itself, but as I'm writing this I feel like I want more clarity on exactly why, so I will do a bit of digging and come back with a better response.
Ah, I remember why this is. Here's a diagram of the current and proposed scenarios that I hope helps:
Normal services (current main
)
- gateway receives a request destined for
batch.hail.is
- gateway intends to forward this request to
batch.default:443
- gateway makes a DNS request to resolve
batch.default
. gateway receives IPA.A.A.A
which is the cluster IP of the batch Kubernetes Service - gateway forwards the request to
A.A.A.A:443
- The Kubernetes Service (really kube-proxy) receives the request, selects a pod with IP
X.X.X.X
and forwards the request toX.X.X.X:5000
Proposed headless service approach
- gateway receives a request destined for
batch.hail.is
- gateway intends to forward this request to
batch.default:443
- gateway makes a DNS request to resolve
batch.default
. gateway receives multiple DNS records back saying thatbatch.default
corresponds to the IP addressesX.X.X.X
,Y.Y.Y.Y
, andZ.Z.Z.Z
(assuming there are 3 pods in the deployment). - gateway gets its pick out of the pods (this is really important and is why envoy needs all the IPs to properly load balance!) and decides to forward the request directly to pod
X.X.X.X:443
So in the second scenario, it is necessary that the pod itself be listening on 443 because that is where gateway is going to send the request. It is not exactly a permissions issue, but upon writing this I am now realizing that by doing so we require that the service pods like auth
and batch
be running as root in order to bind on port 443. I think the port specified in the Service
yaml is actually useless now. So two actionable options are:
- Remove the useless
port
field on theService
yaml for auth, batch, etc. - Keep all of our services on unprivileged ports (5000) and have gateway forward traffic to
batch.default:5000
instead ofbatch.default:443
. Keeping our services on port 5000 could allow us to run those services as non-root users. I guess k8s has them running as root by default…
So after thinking about it a bit, I think 2 (listening on unprivileged ports instead of 443 for our in-cluster communication) would be a good thing to do in general, but I do think that it complicates this PR a bit. There's a few more places where we assume services are listening on 443 e.g. deploy_config.py
, grafana / prometheus, etc. I think it would be best to make this change and then consider separately the task of moving from 443 -> 5000 for in-cluster communication.