postgres-operator
postgres-operator copied to clipboard
Adjustable Timeout Needed for Leader Election in PostgreSQL Operator
- Which image of the operator are you using? ghcr.io/zalando/postgres-operator:v1.13.0
- Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s on rke2
- Are you running Postgres Operator in production? yes
- Type of issue? Bug report/feature request
We’ve observed that when etcd is under heavy load, the PostgreSQL operator fails to complete the setup of a database cluster. Based on my understanding of the code, the operator attempts to communicate with etcd five times in quick succession to designate a leader pod and initiate the database startup. However, in scenarios where we were using slower Azure disks—combined with etcd being under load—the new PostgreSQL database pod became stuck in the leader election process and never recovered.
Is there a way to increase the timeout in the operator to handle such cases? If not can it be added? Generally, we do not understand why leader election is retry limited. We only encounter slow disks in our test environments.