postgres-operator Consistent Crash Loops Attempting to Bootstrap Leader and Failing Replicas

Consistent Crash Loops Attempting to Bootstrap Leader and Failing Replicas

Open Danieloni1 opened this issue 1 year ago • 0 comments

Which image of the operator are you using? e.g. ghcr.io/zalando/postgres-operator:v1.12.2 ghcr.io/zalando/postgres-operator:v1.12.2
Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s | GCP ... | Bare Metal K8s] Bare Metal K8s.
Are you running Postgres Operator in production? [yes | no] Yes
Type of issue? [Bug report, question, feature request, etc.] Bug Report / Request for Help

This is happening to me on v1.12.2. As it is a FATAL error, it allegedly fails my cluster consistently every time I deploy (bare metal). Tried deleting the whole cluster and this reproduces consistently.

$ kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl show-config
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    max_connections: '640'
    max_replication_slots: 10
    max_wal_senders: 10
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_functions: all
    wal_compression: 'on'
    wal_level: hot_standby
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

And the repeating sequence of error + fatal logs:

# kubectl logs -n postgres-operator postgres-cluster-1 | head -n 150

...

2024-07-31 11:06:52,579 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1078, in get_replica_timeline
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1073, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
    conn = psycopg.connect(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 103, in connect
    ret = _connect(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  role "standby" does not exist

2024-07-31 11:06:52,631 INFO: promoted self to leader by acquiring session lock
2024-07-31 11:06:52,678 INFO: Lock owner: postgres-cluster-1; I am postgres-cluster-1
2024-07-31 11:06:52,740 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1078, in get_replica_timeline
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 1073, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
    conn = psycopg.connect(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 103, in connect
    ret = _connect(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  role "standby" does not exist...

...

The cluster's resources are sufficient (memory, CPU and storage).

Failed attempted fixes:

kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl restart postgres-cluster with output:

$ kubectl exec -it -n postgres-operator postgres-cluster-0 -- patronictl restart postgres-cluster
+ Cluster: postgres-cluster (7397754784576811067) ------+----+-----------+
| Member             | Host         | Role    | State   | TL | Lag in MB |
+--------------------+--------------+---------+---------+----+-----------+
| postgres-cluster-0 | 10.99.247.35 | Replica | stopped |    |   unknown |
| postgres-cluster-1 | 10.99.247.38 | Replica | stopped |    |   unknown |
| postgres-cluster-2 | 10.99.247.34 | Replica | stopped |    |   unknown |
+--------------------+--------------+---------+---------+----+-----------+
When should the restart take place (e.g. 2024-08-01T14:11)  [now]:
Are you sure you want to restart members postgres-cluster-0, postgres-cluster-1, postgres-cluster-2? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Failed: restart for member postgres-cluster-0, status code=503, (postgres is still starting)
Failed: restart for member postgres-cluster-1, status code=503, (postgres is still starting)
Failed: restart for member postgres-cluster-2, status code=503, (postgres is still starting)

Restarting operator deployment, restarting cluster statefulset.
Purging and deleting the cluster, starting from scratch, the issue persists.

Please let me know what other information I could attach to help with the understanding of this issue.

Aug 01 '24 11:08 Danieloni1

postgres-operator postgres-operator copied to clipboard

Consistent Crash Loops Attempting to Bootstrap Leader and Failing Replicas

Failed attempted fixes:

postgres-operator
postgres-operator copied to clipboard