postgres-operator Adjustable Timeout Needed for Leader Election in PostgreSQL Operator

Adjustable Timeout Needed for Leader Election in PostgreSQL Operator

Open uk1988 opened this issue 8 months ago • 1 comments

Which image of the operator are you using? ghcr.io/zalando/postgres-operator:v1.13.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s on rke2
Are you running Postgres Operator in production? yes
Type of issue? Bug report/feature request

We’ve observed that when etcd is under heavy load, the PostgreSQL operator fails to complete the setup of a database cluster. Based on my understanding of the code, the operator attempts to communicate with etcd five times in quick succession to designate a leader pod and initiate the database startup. However, in scenarios where we were using slower Azure disks—combined with etcd being under load—the new PostgreSQL database pod became stuck in the leader election process and never recovered.

Is there a way to increase the timeout in the operator to handle such cases? If not can it be added? Generally, we do not understand why leader election is retry limited. We only encounter slow disks in our test environments.

Mar 26 '25 13:03 uk1988

postgres-operator postgres-operator copied to clipboard

Adjustable Timeout Needed for Leader Election in PostgreSQL Operator

postgres-operator
postgres-operator copied to clipboard