postgres-operator
postgres-operator copied to clipboard
Consistent Crash Loops Attempting to Bootstrap Leader and Failing Replicas
- Which image of the operator are you using? e.g. ghcr.io/zalando/postgres-operator:v1.12.2 ghcr.io/zalando/postgres-operator:v1.12.2
- Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s] Bare Metal K8s.
- Are you running Postgres Operator in production? [yes | no] Yes
- Type of issue? [Bug report, question, feature request, etc.] Bug Report / Request for Help
This is happening to me on v1.12.2. As it is a FATAL error, it allegedly fails my cluster consistently every time I deploy (bare metal). Tried deleting the whole cluster and this reproduces consistently.
$ kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl show-config
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: '640'
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 100
track_functions: all
wal_compression: 'on'
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
And the repeating sequence of error + fatal logs:
# kubectl logs -n postgres-operator postgres-cluster-1 | head -n 150
...
2024-07-31 11:06:52,579 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1078, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1073, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 103, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: role "standby" does not exist
2024-07-31 11:06:52,631 INFO: promoted self to leader by acquiring session lock
2024-07-31 11:06:52,678 INFO: Lock owner: postgres-cluster-1; I am postgres-cluster-1
2024-07-31 11:06:52,740 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1078, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1073, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 103, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: role "standby" does not exist...
...
The cluster's resources are sufficient (memory, CPU and storage).
Failed attempted fixes:
kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl restart postgres-clusterwith output:
$ kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl restart postgres-cluster
+ Cluster: postgres-cluster (7397754784576811067) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------------+--------------+---------+---------+----+-----------+
| postgres-cluster-0 | 10.99.247.35 | Replica | stopped | | unknown |
| postgres-cluster-1 | 10.99.247.38 | Replica | stopped | | unknown |
| postgres-cluster-2 | 10.99.247.34 | Replica | stopped | | unknown |
+--------------------+--------------+---------+---------+----+-----------+
When should the restart take place (e.g. 2024-08-01T14:11) [now]:
Are you sure you want to restart members postgres-cluster-0, postgres-cluster-1, postgres-cluster-2? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Failed: restart for member postgres-cluster-0, status code=503, (postgres is still starting)
Failed: restart for member postgres-cluster-1, status code=503, (postgres is still starting)
Failed: restart for member postgres-cluster-2, status code=503, (postgres is still starting)
- Restarting operator deployment, restarting cluster statefulset.
- Purging and deleting the cluster, starting from scratch, the issue persists.
Please let me know what other information I could attach to help with the understanding of this issue.