postgres-operator
postgres-operator copied to clipboard
Load Balancer for PG Master missing Selectors
Problem
When deploying a new cluster using the operator, I dee that the cluster comes up and spilo roles are assigned to them master and replica node(s).
I also see the load balancer for the replica nodes come up.
However, the load balancer for the master node gets created but never gets a external IP. Looking at the ressource created, I see that the LB is missing a "selector" in it's spec. Editing the ressource online and adding the selector for the master spilo-role "fixes" the load balancer.
I've looked at the code in k8res.go and indeed the selectors are only added for the replica load balancer there.
Am I missing something?
Versions Used
I'm testing the operator on GKE running 4 nodes and kubernetes 1.10. I'm using the current master of the postgres-operator.
The master service must not have a selector attached.
Patroni takes care of modifying the endpoint and add the IP of the master directly.
This avoids situations where more than one pod may in fact have the master label.
Thus we need more info or output to why the LB is probably not working.
Ok, I see. However, the bug is nevertheless there. I currently test on GKE. I have a operator setup with 3 nodes. They come up, everything seems ok. The LB is working.
Now to test failover, I deleted the master pod. It gets recreated, and assigned the master role correctly. The LB service for the master does not work -- in the GKE console I see that no pod is assigned to that LB.
What output do you need?
Here's the broken LB state:
apiVersion: v1
kind: Service
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: thingworx-cluster.acid.staging.db.example.com
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"
creationTimestamp: 2018-07-11T12:04:34Z
labels:
application: spilo
spilo-role: master
version: acid-thingworx-cluster
name: acid-thingworx-cluster
namespace: default
resourceVersion: "150357"
selfLink: /api/v1/namespaces/default/services/acid-thingworx-cluster
uid: 9369db92-8502-11e8-b3c7-42010a9c0ff8
spec:
clusterIP: 10.35.241.32
externalTrafficPolicy: Cluster
ports:
- name: postgresql
nodePort: 30444
port: 5432
protocol: TCP
targetPort: 5432
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 35.234.87.166
Since you are on GKE you are unfortunately doing something we have not yet dived into/tested.
So you want the PostgreSQL host on K8S exposed to some application not within the K8S cluster itself?
While we do support this on AWS via ELBs and making this work on GKE too is probably important, the expectation is mostly around having Postgres clusters and applications side by side, e.g. to benefit from the roles and secret setup.
I am not sure how the Google LB works here, I believe it also adds all nodes and only some area then healthy or have the open port to connect to.
@seletz could you, please, post the output of the following:
kubectl get pods -l version=acid-thingworx-cluster -L spilo-role -o wide
kubectl get ep -l version=acid-thingworx-cluster
kubectl get ep acid-thingworx-cluster -o yaml
@Jan-M No, actually I'm right with you there -- I want to have no access of PG outside the cluster. That LB I configured is for my testing only (I hit #330) -- we're not in production yet.
@alexeyklyukin Will do -- I'm currently hitting #330 and am in the process of building my own spilo image. Meanwhile I've shut down my cluster on GKE. I'll try to get over #330 on minikube -- the application I need to deploy on K8S can't handle SSL Postgres connections as it seems.
This also doesn't allow allow to reliably connect from the outside with port-forward:
❯ k port-forward service/foo-main 5432
error: cannot attach to *v1.Service: invalid service 'foo-main': Service is defined without a selector
I understood the rationale behind the decision, so just pointing the problem out.
I am running into this problem currently. How do I port-forward given this error? I don't want to set up an Ingress.
> kubectl port-forward service/production-geodb 5432:5432
error: cannot attach to *v1.Service: invalid service 'production-geodb': Service is defined without a selector
@CarlQLange I'm using this workaround:
kubectl port-forward $(kubectl get pod -l cluster-name=foo-db,spilo-role=master -o jsonpath='{.items[0].metadata.name}') 15432:5432
Strangely, the replica does work, so port-forward svc/production-geodb-repl works but port-forward svc/production-geodb does not. Perhaps this is due to the number of replicas or to useMasterLoadBalancer/useReplicaLoadBalancer...
Thank you for the workaround! I had to change cluster-name to version for some reason, but it does work!
Our use case is actually to be able to access from the outside using Metallb. We can only make it work if we add the selectors to the master service. Will something break if those selectors are there on the master?
There are not supposed to be selectors on the master service. Assigning and removing endpoints is done by Patroni. This is intentional to not end up having two pods in there, which can happen.
Hi there. We're trying to set up postgres-operator inside an Istio service mesh, and are as such trying to fathom how the different Endpoint IP:s are assigned by Patroni.
We've come to the understanding that Patroni runs together with Postgres on each pod in the cluster, and that it's responsible for communicating with etcd to select the leader. From the docs and from your previous replies that the Endpoint (in our case coterie-web-db) is being updated using some method; either the Kubernetes API or by magic that lets a ClusterIP-service route to whichever pod IP has a certain annotation.
We've done a DNS lookup in a consuming service, for the "main" A-record name of the postgres cluster, and the ClusterIP is returned.
What we don't understand is how the pod IP of the current master is routed to from the ClusterIP? Does Patroni talk straight to the Kubernetes API and in that case, why can't we see the pod IP:s when we do a kubectl get endpoint listing?
MetalLB searching for NodeName in v1.Endpoints.Subsets.Addresses when choosing node to announce ( https://github.com/metallb/metallb/blob/main/speaker/layer2_controller.go#L43 )
When endpoints set by postgres-operator there is no such data and MetalLB can't choose node
... Subsets:[{[{10.42.3.67 \u003cnil\u003e nil}] [] [{postgresql 5432 TCP}]}] ...
When I add selector and endpoints set by service controller there is more data in subsets and service successfully anounced
... Subsets: [ {[{10.42.3.67 rnd-pg-cluster-0 0xc000439d00 ObjectReference{Kind:Pod,Namespace:rnd,Name:rnd-pg-cluster-0,UID:127045bc-c715-4005-b325-314e33621f33,APIVersion:,ResourceVersion:27532992,FieldPath:,}}] [] [{postgresql 5432 TCP}]} ] ...
It is irritating for new users that "kubectl port-forward service/... 5432" does not work. Please do at lease document that this is expected in the "check created service resources" block of the Quickstart documentation. Maybe add Ilyas workaround (https://github.com/zalando/postgres-operator/issues/340#issuecomment-520703326).
I am seeing this issue on a k8s-1.18.2 cluster, deployed via kubeadm to Baremetal ubuntu 20. The Endpoints never populate. the cluster always reports as failed to create.
@andy-v-you should look at pods logs, there should be a clue what is wrong.
@CyberDem0n the operator logs?
Spilo pods.
Why is this ticket still open if this is intended behavior?
I am still confused about this issue. How should we connect to the master pod via the master service clusterIP?