postgres-operator Cannot start a standby cluster (Patroni incorrectly promotes itself as master and stops replicating)

Please, answer some short questions which should help us to understand your problem / question better?

Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.5.0
Where do you run it - cloud or metal? Kubernetes or OpenShift? [Bare Metal K8s (RKE on a bunch of OCI VMs)]
Are you running Postgres Operator in production? [no, because this is our very first experiment, so we'll move it to production as soon as this test succeeds!]
Type of issue? [Bug report]

Cluster log:

2020-11-16 16:15:53,019 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?) 2020-11-16 16:15:53,031 - bootstrapping - INFO - Looks like your running openstack 2020-11-16 16:15:53,082 - bootstrapping - INFO - Configuring pam-oauth2 2020-11-16 16:15:53,082 - bootstrapping - INFO - No PAM_OAUTH2 configuration was specified, skipping 2020-11-16 16:15:53,083 - bootstrapping - INFO - Configuring wal-e 2020-11-16 16:15:53,083 - bootstrapping - INFO - Configuring pgbouncer 2020-11-16 16:15:53,083 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping 2020-11-16 16:15:53,083 - bootstrapping - INFO - Configuring pgqd 2020-11-16 16:15:53,083 - bootstrapping - INFO - Configuring crontab 2020-11-16 16:15:53,083 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability 2020-11-16 16:15:53,083 - bootstrapping - INFO - Configuring patroni 2020-11-16 16:15:53,092 - bootstrapping - INFO - Writing to file /home/postgres/postgres.yml 2020-11-16 16:15:53,092 - bootstrapping - INFO - Configuring bootstrap 2020-11-16 16:15:53,093 - bootstrapping - INFO - Writing to file /home/postgres/.pgpass_timescaledb 2020-11-16 16:15:53,093 - bootstrapping - INFO - Configuring certificate 2020-11-16 16:15:53,093 - bootstrapping - INFO - Generating ssl certificate 2020-11-16 16:15:53,115 - bootstrapping - INFO - Configuring log 2020-11-16 16:15:53,116 - bootstrapping - INFO - Configuring standby-cluster 2020-11-16 16:15:54,394 INFO: No PostgreSQL configuration items changed, nothing to reload. 2020-11-16 16:15:54,396 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-16 16:15:54,425 INFO: trying to bootstrap a new cluster 2020-11-16 16:15:54,426 INFO: Running custom bootstrap script: python3 /scripts/clone_with_basebackup.py --pgpass=/home/postgres/.pgpass_timescaledb --host=timescaledb --port=5432 --user="standby" 2020-11-16 16:15:54,473 INFO: cloning cluster exacto-timescaledb from "host=timescaledb port=5432 user=standby dbname=postgres" 2020-11-16 16:16:04,897 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-16 16:16:04,897 INFO: not healthy enough for leader race 2020-11-16 16:16:04,906 INFO: bootstrap in progress

Last 3 rows repeat identical every 10 seconds, then a different row:

2020-11-16 16:15:54,394 INFO: No PostgreSQL configuration items changed, nothing to reload.

Then the same 3 rows every 10 seconds for approx 9 minutes, and finally:

**> 2020-11-16 16:29:03 pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly

2020-11-16 16:29:03 This probably means the server terminated abnormally 2020-11-16 16:29:03 before or while processing the request. 2020-11-16 16:29:03 pg_basebackup: removing contents of data directory "/home/postgres/pgdata/pgroot/data"

At the same time, the log of the source Postgres cluster ("timescaledb") says:

2020-11-16 16:29:03.474 UTC [3223] LOG: could not send data to client: Connection reset by peer 2020-11-16 16:29:03.474 UTC [3223] ERROR: base backup could not send data, aborting backup 2020-11-16 16:29:03.474 UTC [3223] LOG: could not send data to client: Broken pipe 2020-11-16 16:29:03.474 UTC [3223] FATAL: connection to client lost 2020-11-16 16:29:06.420 UTC [3235] LOG: could not receive data from client: Connection reset by peer 2020-11-16 16:29:06.420 UTC [3235] LOG: unexpected EOF on standby connection**

Back to the failing standby cluster:

2020-11-16 16:29:06,420 ERROR: Clone failed 2020-11-16 16:29:06 Traceback (most recent call last): 2020-11-16 16:29:06 File "/scripts/clone_with_basebackup.py", line 61, in main 2020-11-16 16:29:06 run_basebackup(options) 2020-11-16 16:29:06 File "/scripts/clone_with_basebackup.py", line 54, in run_basebackup 2020-11-16 16:29:06 raise Exception("pg_basebackup exited with code={0}".format(ret)) 2020-11-16 16:29:06 Exception: pg_basebackup exited with code=1 2020-11-16 16:29:06,431 INFO: removing initialize key after failed attempt to bootstrap the cluster 2020-11-16 16:29:06,479 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2020-11-16-16-29-06 2020-11-16 16:29:06,494 ERROR: Unexpected exception 2020-11-16 16:29:06 Traceback (most recent call last): 2020-11-16 16:29:06 File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1385, in run_cycle 2020-11-16 16:29:06 info = self._run_cycle() 2020-11-16 16:29:06 File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1288, in _run_cycle 2020-11-16 16:29:06 return self.post_bootstrap() 2020-11-16 16:29:06 File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1184, in post_bootstrap 2020-11-16 16:29:06 self.cancel_initialization() 2020-11-16 16:29:06 File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1179, in cancel_initialization 2020-11-16 16:29:06 raise PatroniException('Failed to bootstrap cluster') 2020-11-16 16:29:06 patroni.exceptions.PatroniException: 'Failed to bootstrap cluster' 2020-11-16 16:29:06,495 INFO: Unexpected exception raised, please report it as a BUG

And last 13 rows repeat infinitely every 10 seconds, so I've been to extract these logs thanks to my Grafana/Loki stack, and would I have been left without any logs otherways.

Source and target clusters seem to accuse each other for the connection closure... Please advise, as we need to finalize this first test in order to be able to move postgres-operator to Production as soon as possible! Thank you for this admirable piece of software!

Nov 16 '20 23:11 adacaccia

This morning the pg_basebackup step went fine, after deleting the cluster and recreating it with the fairly minimal standby-manifest.yaml:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: exacto-timescaledb
  namespace: biot
spec:
  teamId: "exacto"
  volume:
    size: 1Ti
  numberOfInstances: 1
  postgresql:
    version: "12"
  dockerImage: docker-registry.exacto-energy.com/exacto/exacto-spilo:test
\# Make this a standby cluster and provide the s3 bucket path of source cluster for continuous streaming.
  standby:
    s3_wal_path: "s3://timescaledb-wal/"
  clone:
    cluster: timescaledb

Custom image is needed because of a strange combination between various locales on the source cluster, which I cannot fix right now (800GB of streaming data...), here's the Dockerfile:

FROM registry.opensource.zalan.do/acid/spilo-12:1.6-p5
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
 && DEBIAN_FRONTEND=noninteractive apt-get install -y locales \
 && locale-gen it_IT \
 && locale-gen it_IT.UTF-8 \
 && locale-gen en_US \
 && locale-gen en_US.UTF-8 \
 && update-locale

Here's the original error, right after pg_basebackup completion:

2020-11-17 11:26:24,387 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-17 11:26:24,388 INFO: not healthy enough for leader race 2020-11-17 11:26:24,388 INFO: bootstrap in progress 2020-11-17 11:26:28,005 maybe_pg_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload. 2020-11-17 11:26:28.790 UTC [727] LOG: invalid value for parameter "lc_messages": "it_IT.UTF-8" 2020-11-17 11:26:28.790 UTC [727] LOG: invalid value for parameter "lc_monetary": "it_IT.UTF-8" 2020-11-17 11:26:28.790 UTC [727] LOG: invalid value for parameter "lc_numeric": "it_IT.UTF-8" 2020-11-17 11:26:28.790 UTC [727] LOG: invalid value for parameter "lc_time": "it_IT.UTF-8" 2020-11-17 11:26:28 UTC [727]: [5-1] 5fb3b364.2d7 0 FATAL: configuration file "/home/postgres/pgdata/pgroot/data/postgresql.base.conf" contains errors 2020-11-17 11:26:28,796 INFO: postmaster pid=727 /var/run/postgresql:5432 - no response 2020-11-17 11:26:28,893 INFO: removing initialize key after failed attempt to bootstrap the cluster 2020-11-17 11:26:28,915 INFO: renaming data directory to /home/postgres/pgdata/pgroot/data_2020-11-17-11-26-28 2020-11-17 11:26:28,936 ERROR: Unexpected exception Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1385, in run_cycle info = self._run_cycle() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1288, in _run_cycle return self.post_bootstrap() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1184, in post_bootstrap self.cancel_initialization() File "/usr/local/lib/python3.6/dist-packages/patroni/ha.py", line 1179, in cancel_initialization raise PatroniException('Failed to bootstrap cluster') patroni.exceptions.PatroniException: 'Failed to bootstrap cluster' 2020-11-17 11:26:28,957 INFO: Unexpected exception raised, please report it as a BUG 2020-11-17 11:26:28,959 ERROR: Exception when saving file: /home/postgres/pgdata/pgroot/data/patroni.dynamic.json

So I've created the custom image exacto-spilo with added locales, then applyed it:

2020-11-17 11:53:35,334 INFO: No PostgreSQL configuration items changed, nothing to reload. 2020-11-17 11:53:35,344 WARNING: Postgresql is not running. 2020-11-17 11:53:35,344 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-17 11:53:35,347 INFO: Lock owner: None; I am exacto-timescaledb-0

this pg_controldata has been dumped 2 times to log, only showing one...

2020-11-17 11:53:35,347 INFO: pg_controldata: pg_control version number: 1201 Catalog version number: 201909212 Database system identifier: 6873431027336362252 Database cluster state: shut down in recovery pg_control last modified: Tue Nov 17 11:53:10 2020 Latest checkpoint location: C9C/95000028 Latest checkpoint's REDO location: C9C/95000028 Latest checkpoint's REDO WAL file: 0000000100000C9C00000095 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 1:3413210839 Latest checkpoint's NextOID: 22820743 Latest checkpoint's NextMultiXactId: 786600768 Latest checkpoint's NextMultiOffset: 2953219959 Latest checkpoint's oldestXID: 3213300797 Latest checkpoint's oldestXID's DB: 16594 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 677297576 Latest checkpoint's oldestMulti's DB: 16594 Latest checkpoint's oldestCommitTsXid: 0 Latest checkpoint's newestCommitTsXid: 0 Time of latest checkpoint: Tue Nov 17 11:44:30 2020 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: C9C/950000A0 Min recovery ending loc's timeline: 1 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_wal_senders setting: 10 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 6bb4bf2b5744ccb76a9720635949ac5143bba6427e0a92e4dc70847af73cec4b

Patroni correctly starts as a secondary:

2020-11-17 11:53:35,347 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-17 11:53:35,360 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-17 11:53:35,360 INFO: Lock owner: None; I am exacto-timescaledb-0 2020-11-17 11:53:35,376 INFO: starting as a secondary 2020-11-17 11:53:35,376 INFO: starting as a secondary 2020-11-17 11:53:35,554 INFO: postmaster pid=2323 2020-11-17 11:53:35 UTC [2323]: [1-1] 5fb3b9bf.913 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2020-11-17 11:53:35 UTC [2323]: [2-1] 5fb3b9bf.913 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2020-11-17 11:53:35 UTC [2323]: [3-1] 5fb3b9bf.913 0 LOG: starting PostgreSQL 12.4 (Ubuntu 12.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit 2020-11-17 11:53:35,554 INFO: postmaster pid=2323 2020-11-17 11:53:35 UTC [2323]: [1-1] 5fb3b9bf.913 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter... 2020-11-17 11:53:35 UTC [2323]: [2-1] 5fb3b9bf.913 0 LOG: pg_stat_kcache.linux_hz is set to 1000000 2020-11-17 11:53:35 UTC [2323]: [3-1] 5fb3b9bf.913 0 LOG: starting PostgreSQL 12.4 (Ubuntu 12.4-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit 2020-11-17 11:53:35 UTC [2323]: [4-1] 5fb3b9bf.913 0 LOG: listening on IPv4 address "0.0.0.0", port 5432 2020-11-17 11:53:35 UTC [2323]: [5-1] 5fb3b9bf.913 0 LOG: listening on IPv6 address "::", port 5432 2020-11-17 11:53:35 UTC [2323]: [4-1] 5fb3b9bf.913 0 LOG: listening on IPv4 address "0.0.0.0", port 5432 2020-11-17 11:53:35 UTC [2323]: [5-1] 5fb3b9bf.913 0 LOG: listening on IPv6 address "::", port 5432 /var/run/postgresql:5432 - no response /var/run/postgresql:5432 - no response 2020-11-17 11:53:35 UTC [2323]: [6-1] 5fb3b9bf.913 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2020-11-17 11:53:35 UTC [2323]: [6-1] 5fb3b9bf.913 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2020-11-17 11:53:35 UTC [2323]: [7-1] 5fb3b9bf.913 0 LOG: redirecting log output to logging collector process 2020-11-17 11:53:35 UTC [2323]: [8-1] 5fb3b9bf.913 0 HINT: Future log output will appear in directory "../pg_log". 2020-11-17 11:53:35 UTC [2323]: [7-1] 5fb3b9bf.913 0 LOG: redirecting log output to logging collector process tsNs 1605614015603728456 ts 2020-11-17T11:53:35.603Z Parsed Fields: container postgres team exacto statefulset_kubernetes_io_pod_name exacto-timescaledb-0 spilo_role master pod exacto-timescaledb-0 namespace biot filename /var/log/pods/biot_exacto-timescaledb-0_6dfed115-9ef0-4bfe-b6b6-eeef216d3579/postgres/0.log application spilo stream stdout job biot/exacto-timescaledb controller_revision_hash exacto-timescaledb-78c4b544c7 cluster_name exacto-timescaledb Log Labels: 2020-11-17 11:53:35 UTC [2323]: [8-1] 5fb3b9bf.913 0 HINT: Future log output will appear in directory "../pg_log". /var/run/postgresql:5432 - accepting connections /var/run/postgresql:5432 - accepting connections /var/run/postgresql:5432 - accepting connections /var/run/postgresql:5432 - accepting connections 2020-11-17 11:53:36,595 INFO: establishing a new patroni connection to the postgres cluster 2020-11-17 11:53:36,595 INFO: establishing a new patroni connection to the postgres cluster

All seemed to be working fine up to this point, but then:

2020-11-17 11:53:36,673 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'" 2020-11-17 11:53:36,673 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'" tsNs 1605614016683841218 ts 2020-11-17T11:53:36.683Z Parsed Fields: pod exacto-timescaledb-0 namespace biot job biot/exacto-timescaledb controller_revision_hash exacto-timescaledb-78c4b544c7 container postgres cluster_name exacto-timescaledb statefulset_kubernetes_io_pod_name exacto-timescaledb-0 team exacto stream stdout filename /var/log/pods/biot_exacto-timescaledb-0_6dfed115-9ef0-4bfe-b6b6-eeef216d3579/postgres/0.log application spilo Log Labels:

OK, maybe watchdog is not important, but... what about replication connection?

2020-11-17 11:53:36,681 ERROR: Can not fetch local timeline and lsn from replication connection Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/init.py", line 735, in get_replica_timeline with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur: File "/usr/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/init.py", line 730, in get_replication_connection_cursor with get_connection_cursor(**conn_kwargs) as cur: File "/usr/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor with psycopg2.connect(**kwargs) as conn: File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: FATAL: password authentication failed for user "standby"

And this is insane, as the "standby" user got created by the replica, and obviously it has the same password it had on the old cluster, which is saved in its own k8s secret... Then there's another recap of known information:

tsNs 1605614016683988015 ts 2020-11-17T11:53:36.683Z Parsed Fields: pod exacto-timescaledb-0 namespace biot job biot/exacto-timescaledb controller_revision_hash exacto-timescaledb-78c4b544c7 container postgres cluster_name exacto-timescaledb statefulset_kubernetes_io_pod_name exacto-timescaledb-0 team exacto stream stdout filename /var/log/pods/biot_exacto-timescaledb-0_6dfed115-9ef0-4bfe-b6b6-eeef216d3579/postgres/0.log application spilo Log Labels:

And another insane connection failure, as pg_hba.conf is managed by Patroni itself:

FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off 2020-11-17 11:53:36,681 ERROR: Can not fetch local timeline and lsn from replication connection Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/init.py", line 735, in get_replica_timeline with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur: File "/usr/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/init.py", line 730, in get_replication_connection_cursor with get_connection_cursor(**conn_kwargs) as cur: File "/usr/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/usr/local/lib/python3.6/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor with psycopg2.connect(**kwargs) as conn: File "/usr/lib/python3/dist-packages/psycopg2/init.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: FATAL: password authentication failed for user "standby" FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off

Then Patroni decides it's time to promote itself:

2020-11-17 11:53:36,733 INFO: promoted self to leader by acquiring session lock 2020-11-17 11:53:36,733 INFO: promoted self to leader by acquiring session lock 2020-11-17 11:53:36,734 INFO: Lock owner: exacto-timescaledb-0; I am exacto-timescaledb-0 server promoting 2020-11-17 11:53:36,734 INFO: Lock owner: exacto-timescaledb-0; I am exacto-timescaledb-0 server promoting 2020-11-17 11:53:36,738 INFO: cleared rewind state after becoming the leader 2020-11-17 11:53:36,738 INFO: cleared rewind state after becoming the leader /scripts/on_role_change.sh: line 6: on_role_change: command not found /scripts/on_role_change.sh: line 6: on_role_change: command not found 2020-11-17 11:53:36,746 INFO: Lock owner: exacto-timescaledb-0; I am exacto-timescaledb-0 2020-11-17 11:53:36,746 INFO: Lock owner: exacto-timescaledb-0; I am exacto-timescaledb-0 tsNs 1605614016766450394 ts 2020-11-17T11:53:36.766Z Parsed Fields: pod exacto-timescaledb-0 namespace biot job biot/exacto-timescaledb controller_revision_hash exacto-timescaledb-78c4b544c7 container postgres cluster_name exacto-timescaledb statefulset_kubernetes_io_pod_name exacto-timescaledb-0 team exacto stream stdout filename /var/log/pods/biot_exacto-timescaledb-0_6dfed115-9ef0-4bfe-b6b6-eeef216d3579/postgres/0.log application spilo Log Labels: 2020-11-17 11:53:36,765 ERROR: Can not fetch local timeline and lsn from replication connection

Same error again and again:

FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off

Finally, a quick check of WAL status on old and new cluster clearly shows a diverging lap:

kubectl exec timescaledb-0 -n biot -- psql -U postgres -c"SELECT pg_current_wal_lsn();"
 pg_current_wal_lsn 
--------------------
 C9D/A22C2F38

kubectl exec exacto-timescaledb-0 -n biot -- psql -U postgres -c"SELECT pg_current_wal_lsn();"
 pg_current_wal_lsn 
--------------------
 C9C/98000000

What am I missing here? Thank you...

Nov 17 '20 14:11 adacaccia

Ok, so I'll summarize the open points:

standby cluster configuration written to patroni is incomplete:

bootstrap: clone_with_basebackup: command: python3 /scripts/clone_with_basebackup.py --pgpass=/home/postgres/.pgpass_timescaledb --host=timescaledb --port=5432 --user="standby"

but it is missing both --datadir and --scope parameters! So it only works when launched manually, amending the call as suggested.

After the (lenghty!) basebackup finishes, Patroni tries to bootstrap the standby cluster, but gets a very strange

psycopg2.OperationalError: FATAL: password authentication failed for user "standby"

where the "standby" user has just been successfully used to do the basebackup! How can it be?

And after this error, a pretty similar one appears:

psycopg2.OperationalError: FATAL: password authentication failed for user "standby" FATAL: no pg_hba.conf entry for replication connection from host "127.0.0.1", user "standby", SSL off

once again, it's the same "standby" user... how can it be?

Right after that second access error, Patroni decides to promote itself:

2020-11-17 11:53:36,733 INFO: promoted self to leader by acquiring session lock

so it starts writing tables and functions to the DB, screwing up the data_dir, which becomes no more useful as a basebackup since now the WAL timelines start diverging! So I need to trash it and start over again, but with fewer hopes...

Nov 26 '20 15:11 adacaccia

@adacaccia I'm facing the same issue, were you able to solve it?

Aug 07 '24 12:08 danpe

postgres-operator postgres-operator copied to clipboard

Cannot start a standby cluster (Patroni incorrectly promotes itself as master and stops replicating)

postgres-operator
postgres-operator copied to clipboard