[Operator] Don't use `cluster.local` in requests to NATS
Describe the Task
The operator currently makes a request to NATS at the URL <ip>.pl.pod.cluster.local. This assumes that users' cluster domain is cluster.local. It can also cause problems when users' nodes have unorthodox DNS setups (see #1544). We should remove the cluster.local from the URL and rely on kubernetes DNS search paths. In other words, we should request <ip>.pl.pod instead. However, currently our certificates SANs include *.pl.pod.cluster.local and not *.pl.pod. So we need to update the certs before making this change.
Subtasks
- [ ] Update vizier certificate SANs to include
*.pl.pod - [ ] Release vizier to ensure clusters have the new certs
- [ ] Update operator to request from NATs at
*.pl.pod
Side note: when vizier is deployed with dev_cloud_namespace it will connect to vzconn using a cluster.local address, so we can fix that as part of this issue too.
Hi @JamesMBartlett, Any updates on this issue ? Or are there any workarounds. I am having a similar issue when using a custom domain with cloud dns and GKE. Thanks!
I set up my own GKE cluster using Cloud DNS with a GKE cluster scope to perform some testing on this. I believe the solution that @JamesMBartlett outlined above will work for environments that use kube-dns, but unfortunately Cloud DNS or anything using k8's external-dns won't. The external-dns FAQ states that only Services of the following type are supported (meaning Pod DNS records won't exist):
Services exposed via type=LoadBalancer, type=ExternalName, type=NodePort, and for the hostnames defined in Ingress objects as well as headless hostPort services.
Therefore, the Cloud DNS controller doesn't create any records within the pod.<cluster domain> subdomain. It does appear to create IP records, but they are scoped within <resource namespace>.svc.<cluster domain>.
$ kubectl -n pl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pl-nats-0 0/1 Running 0 8s 10.145.240.158 gke-ddelnano-nr-168275-default-pool-3a1b8344-7171 <none> <none>
So the pl-nats-mgmt service will create corresponding svc records, but the pl-nats service won't. Where we generate the vizier SANs it seems that pl-nats-mgmt isn't referenced (source).
@JamesMBartlett what's the purpose for having both pl-nats and pl-nats-mgmt? It seems the former is what the vizier components use and the latter is used in the init containers. If the operator can use the pl-nats-mgmt Service, than I think we could accommodate the external-dns use cases by trying <ip>.pl-nats-mgmt if the existing request is unsuccessful (<ip>.pl.pod.<cluster subdomain>).
In order to test this out, I have a very crude experiment that updates the vizier SANs to use the cluster domain rather than the hardcoded cluster.local and have updated the operator to use the pl-nats-mgmt and cloud connector service rather than the Pod DNS names only available from kubedns (diff, branch).
I deployed the operator with skaffold and created a Vizier CRD by hand and verified that the vizier is created successfully. So this issue isn't limited to NATs. The cloud connector also has a similar problem, except it doesn't have a corresponding headless service that will return the A records of the Pod.
In order to test this out, I have a very crude experiment that updates the vizier SANs to use the cluster domain rather than the hardcoded
cluster.localand have updated the operator to use thepl-nats-mgmtand cloud connector service rather than the Pod DNS names only available from kubedns (diff, branch).I deployed the operator with skaffold and created a Vizier CRD by hand and verified that the vizier is created successfully. So this issue isn't limited to NATs. The cloud connector also has a similar problem, except it doesn't have a corresponding headless service that will return the A records of the Pod.
That diff looks reasonable to me.
I don't immediately see any issue with using the SvcPodAddr for pl-nats-mgmt instead of the nats pod directly.
As far as the purpose of both pl-nats and pl-nats-mgmt, I can't quite remember why we did that. @vihangm or @aimichelle might have more context on that decision.
I caught up with @vihangm and @aimichelle and believe that the approach outlined is a good direction to pursue. I need to follow up with productionizing my crude change, but don't have a timeline for implementing that yet.
Hi all, any progress on this topic?
Hey @rkuklins, I don't have access to a GKE cluster anymore, so I'm not able to easily test and clean up my change from above. If you have interest in attempting address this, I'm happy to help through the implementation of that.