charts [bitnami/postgresql-ha] There is a severe bug with the postgres-HA on K8 as the restarting down node REPMGR ignores the fact that there is a newly elected primary putting itself as the primary causing split brain. The witness doesn't help at all.

Name and Version

bitnami/postgresql-ha: latest

What architecture are you using?

amd64

What steps will reproduce the bug?

This is on an Azure AKS cluster with CPU 600 and memory 1000

I've delayed the start per a command so there is no conceivable reason why when the down node restarts it will mark itself as the primary node when there is a clearly elected new primary.

in AKS K8's I am deleting the primary node for a failover test. I have analyzed the election of the standby and created some delay of start and adjusted liveness probe settings which delay everything accordingly. Meaning, the oncoming deleted node after fail is happening later than the promoted standby. There should be ample time for the understanding of the cluster to do the correct action which is to downvote itself into a primary. If it just worked this way it would be great.

The main thing to me is the pretending actions which seem bizarre. It's as if it is seeing what is correct but just ignoring it and promoting itself.

The other issue here is that it isn't even connecting to the the other 2 nodes which I have the min set at 2. So, whatever that setting is doing it is not failing over for the fact that there is 0 standby's nor the witness reconnecting/attaching to it; which is probably why the witness isn't helping here. Nor, is the new primary seemingly affected that there is a missing node. It seems like it is pinging it over and over but can't reach it. Again, I feel like this could be an issue where the restarting node 0 isn't connecting and or checking on the witness to see what is what. The witness here should be more robustly able to help oncoming nodes that have been deemed to be restarting. I don't know if that is a thing but it would make the Witness much more valuable.

The old node coming back online:

postgresql-repmgr 05:48:30.34
postgresql-repmgr 05:48:30.34 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 05:48:30.34 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 05:48:30.35 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 05:48:30.35
postgresql-repmgr 05:48:30.36 DEBUG ==> Configuring libnss_wrapper...
postgresql-repmgr 05:48:30.37 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 05:48:30.43 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 05:48:30.43 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 05:48:30.43 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 05:48:30.44 DEBUG ==> Checking node 'xxxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432'...
psql: error: connection to server at "xxxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local" (10.244.11.5), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
postgresql-repmgr 05:48:30.46 DEBUG ==> Skipping: failed to get primary from the node 'xxxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432'!
postgresql-repmgr 05:48:30.46 DEBUG ==> Checking node 'xxxxx-postgresql-ha-postgresql-1.xxxx-postgresql-ha-postgresql-headless.xxxx-postgresql-ha.svc.cluster.local:5432'...
postgresql-repmgr 05:48:30.56 DEBUG ==> Pretending primary role node - 'xxxxx-postgresql-ha-postgresql-1.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432'
postgresql-repmgr 05:48:30.56 DEBUG ==> Pretending primary set to 'xxxxx-postgresql-ha-postgresql-1.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432'!
postgresql-repmgr 05:48:30.57 DEBUG ==> Checking node 'xxxxx-postgresql-ha-postgresql-2.xxxxx-postgresql-ha-postgresql-headless.xxxx-postgresql-ha.svc.cluster.local:5432'...
postgresql-repmgr 05:48:30.62 DEBUG ==> Pretending primary role node - 'xxxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432'
**postgresql-repmgr 05:48:30.62 WARN  ==> Conflict of pretending primary role nodes (previously: 'xxxxx-postgresql-ha-postgresql-1.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432', now: 'xxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432')**
postgresql-repmgr 05:48:30.63 INFO  ==> This node was acting as a primary before restart!
postgresql-repmgr 05:48:30.63 INFO  ==> Can not find new primary. Starting PostgreSQL normally...
postgresql-repmgr 05:48:30.63 INFO  ==> There are no nodes with primary role. Assuming the primary role...
postgresql-repmgr 05:48:30.64 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 05:48:30.64 DEBUG ==> Injecting a new postgresql.conf file...
postgresql-repmgr 05:48:30.65 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 05:48:30.84 DEBUG ==> Injecting a new pg_hba.conf file...
postgresql-repmgr 05:48:30.85 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 05:48:30.86 DEBUG ==> Node ID: '1000', Rol: 'primary', Primary Node: ':5432'
postgresql-repmgr 05:48:30.86 INFO  ==> Initializing Repmgr...
postgresql-repmgr 05:48:30.87 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 05:48:30.91 DEBUG ==> Copying files from /bitnami/postgresql/conf to /opt/bitnami/postgresql/conf
postgresql-repmgr 05:48:30.92 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 05:48:30.92 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 05:48:30.92 DEBUG ==> Ensuring expected directories/files exist...
postgresql-repmgr 05:48:30.95 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 05:48:31.01 INFO  ==> Configuring replication parameters
postgresql-repmgr 05:48:31.04 INFO  ==> Configuring fsync
postgresql-repmgr 05:48:31.06 DEBUG ==> Skipping repmgr configuration...
postgresql-repmgr 05:48:31.06 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **

postgresql-repmgr 05:48:31.13 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2023-11-16 05:48:31.163 GMT [230] LOG:  pgaudit extension initialized
2023-11-16 05:48:31.176 GMT [230] LOG:  redirecting log output to logging collector process
2023-11-16 05:48:31.176 GMT [230] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2023-11-16 05:48:31.176 GMT [230] LOG:  starting PostgreSQL 16.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-11-16 05:48:31.211 GMT [230] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-11-16 05:48:31.211 GMT [230] LOG:  listening on IPv6 address "::", port 5432
2023-11-16 05:48:31.224 GMT [230] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-16 05:48:31.241 GMT [234] LOG:  database system was shut down at 2023-11-16 05:47:07 GMT
2023-11-16 05:48:31.256 GMT [230] LOG:  database system is ready to accept connections
 done
server started
postgresql-repmgr 05:48:31.36 INFO  ==> ** Starting repmgrd **
[2023-11-16 05:48:31] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
[2023-11-16 05:48:31] [INFO] connecting to database "user=repmgr password=crabhiv!_AE732lias-d=wec host=xxxx-postgresql-ha-postgresql-0.xxxx-postgresql-ha-postgresql-headless.xxxx-postgresql-ha.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-11-16 05:48:31] [DEBUG] connecting to: "user=repmgr password=crabhiv!_AE732lias-d=wec connect_timeout=5 dbname=repmgr host=xxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="
[2023-11-16 05:48:31] [DEBUG] node id is 1000, upstream node id is -1
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-11-16 05:48:31] [NOTICE] starting monitoring of node "xxxxx-postgresql-ha-postgresql-0" (ID: 1000)
[2023-11-16 05:48:31] [INFO] "connection_check_type" set to "ping"
[2023-11-16 05:48:31] [INFO] executing notification command for event "repmgrd_start"
[2023-11-16 05:48:31] [DETAIL] command is:
  /opt/bitnami/repmgr/events/router.sh 1000 repmgrd_start 1 "2023-11-16 05:48:31.419816+00" "monitoring cluster primary \"xxxxx-postgresql-ha-postgresql-0\" (ID: 1000)"
[2023-11-16 05:48:31] [NOTICE] monitoring cluster primary "xxxxx-postgresql-ha-postgresql-0" (ID: 1000)
[2023-11-16 05:48:31] [DEBUG] connecting to: "user=repmgr password=crabhiv!_AE732lias-d=wec connect_timeout=5 dbname=repmgr host=xxxxx-postgresql-ha-postgresql-witness-0.xxxxx-postgresql-ha-postgresql-witness.xxxxx-postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="
[2023-11-16 05:48:31] [DEBUG] witness node 2000's upstream node ID reported as 1001
[2023-11-16 05:48:31] [INFO] child node "xxxx-postgresql-ha-postgresql-witness-0" (ID: 2000) is not yet attached
[2023-11-16 05:48:31] [INFO] child node "xxxxx-postgresql-ha-postgresql-1" (ID: 1001) is not yet attached
[2023-11-16 05:48:31] [INFO] child node "xxxxx-postgresql-ha-postgresql-2" (ID: 1002) is not yet attached
[2023-11-16 05:48:37] [DEBUG] connecting to: "user=repmgr password=crabhiv!_AE732lias-d=wec connect_timeout=5 dbname=repmgr host=xxxxx-postgresql-ha-postgresql-witness-0.xxxxx-postgresql-ha-postgresql-witness.xxxxx-postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="
[2023-11-16 05:48:37] [DEBUG] witness node 2000's upstream node ID reported as 1001
[2023-11-16 05:48:37] [DEBUG] child node: 2000; attached: no
[2023-11-16 05:48:37] [DEBUG] child node: 1001; attached: no
[2023-11-16 05:48:37] [DEBUG] child node: 1002; attached: no

The witness node reattaching:

INFO:  node 1002 received notification to follow node 1001
[2023-11-16 05:47:13] [NOTICE] notifying node "xxxxx-postgresql-ha-postgresql-witness-0" (ID: 2000) to follow node 1001
INFO:  node 2000 received notification to follow node 1001
[2023-11-16 05:47:13] [INFO] switching to primary monitoring mode
[2023-11-16 05:47:13] [INFO] executing notification command for event "repmgrd_reload"
[2023-11-16 05:47:13] [DETAIL] command is:
  /opt/bitnami/repmgr/events/router.sh 1001 repmgrd_reload 1 "2023-11-16 05:47:13.996392+00" "monitoring cluster primary \"xxxxx-postgresql-ha-postgresql-1\" (ID: 1001)"
[2023-11-16 05:47:14] [NOTICE] monitoring cluster primary "xxxxx-postgresql-ha-postgresql-1" (ID: 1001)
[2023-11-16 05:47:20] [DEBUG] child node: 1002; attached: no
[2023-11-16 05:47:20] [DEBUG] connecting to: "user=repmgr password=crabhiv!_AE732lias-d=wec connect_timeout=5 dbname=repmgr host=xxxxx-postgresql-ha-postgresql-witness-0.xxxxx-postgresql-ha-postgresql-witness.xxxxx-postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="
[2023-11-16 05:47:20] [DEBUG] witness node 2000's upstream node ID reported as 1001
[2023-11-16 05:47:20] [DEBUG] child node: 2000; attached: yes

The other node 2 reattaching to new primary

[2023-11-16 05:49:09] [DETAIL] command is:
  /opt/bitnami/repmgr/events/router.sh 1001 child_node_new_connect 1 "2023-11-16 05:49:09.430905+00" "new standby \"xxxxx-postgresql-ha-postgresql-2\" (ID: 1002) has connected"
2023-11-16 05:49:11.360 GMT [1493] LOG:  could not receive data from client: Connection reset by peer
[2023-11-16 05:49:15] [DEBUG] child node: 1002; attached: yes

Continuing hanging state of new primary trying to connect node 0 (old primary)

[2023-11-16 06:14:38] [DEBUG] connecting to: "user=repmgr password=crabhiv!_AE732lias-d=wec connect_timeout=5 dbname=repmgr host=xxxxx-postgresql-ha-postgresql-witness-0.xxxxx-postgresql-ha-postgresql-witness.xxxxx-postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="

The original startup of node 1 (0, 1, 2) As you see node 1001 comes online and immediately knows that node 1000 is the primary node. This action doesn't seem to happen correctly on the restarting node. First, log above.

postgresql-repmgr 05:41:57.65 DEBUG ==> Injecting a new pg_hba.conf file...
postgresql-repmgr 05:41:57.66 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 05:41:57.67 DEBUG ==> Node ID: '1001', Rol: 'standby', Primary Node: 'xxxxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxxx-postgresql-ha.svc.cluster.local:5432'
postgresql-repmgr 05:41:57.67 INFO  ==> Initializing Repmgr...
postgresql-repmgr 05:41:57.72 INFO  ==> Waiting for primary node...

The main error seems to be here. The conflict, again could be checking with the witness but it doesn't, is actually correct. The fact that it is saying previously doesn't make any sense. Where would it even get that from? Before node 1000 got deleted node 1001 was a standby. The logic issue has to be there.

**postgresql-repmgr 05:48:30.62 WARN  ==> Conflict of pretending primary role nodes (previously: 'xxxxx-postgresql-ha-postgresql-1.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432', now: 'xxxxxx-postgresql-ha-postgresql-0.xxxxx-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local:5432')**

Are you using any custom parameters or values?

Per another post this setting is to pause the startup from happening. perhaps there is another startup script I need to pause here as well.

postgresql:
  command:
    - /bin/bash
  args:
    - -ec
    - |
      sleep 60
      /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh /opt/bitnami/scripts/postgresql-repmgr/run.sh

The recover setting is so that the standby can eventually reattach. (this works correctly) I'll write up something about this later. this setting is very valuable as there needs to be some delay to let a node have time to establish itself as a standby as restarting a pod takes time connecting the network and everything.

extendedConf: |-
    wal_keep_size = '1GB'
    recovery_min_apply_delay = 180s

What is the expected behavior?

The expected behavior is that when a primary node fails / is deleted / goes down it would come back properly and fix the server set accordingly and no be a dangling server node.

What do you see instead?

The primary that went down is coming back as a single primary and never properly associating back into the node set properly

Additional information

Keep in mind, this is a severe issue that makes it unusable because of the constant background pinging (trying to find that other node) that would occur after a node goes down that is not even in the error logs just the debug logs (so it's happening) which means the pinging is happening over and over again because it can't find the node. At the very least a fail safe mechanism should be there in failover to get the pods back up and running. There is no fail gracefully here unless you fail gracefully. These issues have been happening in the comments for a while (years). If you help point me to the write code that is affecting this exact issue I'll fix it and do a pull request.

Nov 16 '23 06:11 xtianus79

Hi @xtianus79

I have tried to reproduce your issue locally (using a Minikube cluster) just in case, but I was unable to face it. I performed the same set of actions a couple of times and still got the same results. Could you please share the exact steps you are following so that I can try to replicate them and make sure I got it right? These are the steps I performed on my side:

Write down a values.yaml file with the customised params:

$ cat values.yaml
postgresql:
  image:
    debug: true
  command:
    - /bin/bash
  args:
    - -ec
    - |
      sleep 60
      /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh /opt/bitnami/scripts/postgresql-repmgr/run.sh
  extendedConf: |-
    wal_keep_size = '1GB'
    recovery_min_apply_delay = 180s
witness:
  create: true
pgpool:
  image:
    debug: true
volumePermissions:
  enabled: true

Install the chart using the aforementioned params and wait for it to be ready:

$ helm install psql-ha bitnami/postgresql-ha -f values.yaml
$ kubectl get pods
NAME                                            READY   STATUS    RESTARTS        AGE
psql-ha-postgresql-ha-pgpool-78cfb6dc55-lvvgb   1/1     Running   1 (4m17s ago)   5m27s
psql-ha-postgresql-ha-postgresql-0              1/1     Running   0               5m27s
psql-ha-postgresql-ha-postgresql-1              1/1     Running   1 (3m50s ago)   5m27s
psql-ha-postgresql-ha-postgresql-2              1/1     Running   1 (3m51s ago)   5m27s
psql-ha-postgresql-ha-postgresql-witness-0      1/1     Running   1 (4m10s ago)   5m27s

# NODE 0 HAS BEEN ELECTED AS PRIMARY
$ kubectl logs psql-ha-postgresql-ha-postgresql-0
...
postgresql-repmgr 12:12:18.48 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 12:12:18.49 DEBUG ==> Checking node 'psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'...
psql: error: connection to server at "psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local" (10.244.0.235), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
postgresql-repmgr 12:12:18.50 DEBUG ==> Skipping: failed to get primary from the node 'psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'!
postgresql-repmgr 12:12:18.51 DEBUG ==> Checking node 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'...
psql: error: could not translate host name "psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local" to address: Name or service not known
postgresql-repmgr 12:12:18.52 DEBUG ==> Skipping: failed to get primary from the node 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'!
postgresql-repmgr 12:12:18.52 DEBUG ==> Checking node 'psql-ha-postgresql-ha-postgresql-2.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'...
psql: error: could not translate host name "psql-ha-postgresql-ha-postgresql-2.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local" to address: Name or service not known
postgresql-repmgr 12:12:18.53 DEBUG ==> Skipping: failed to get primary from the node 'psql-ha-postgresql-ha-postgresql-2.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'!
postgresql-repmgr 12:12:18.54 INFO  ==> There are no nodes with primary role. Assuming the primary role...

$ kubectl logs psql-ha-postgresql-ha-postgresql-witness-0
...
postgresql-repmgr 12:12:52.18 INFO  ==> ** Starting repmgrd **
[2023-11-20 12:12:52] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-11-20 12:12:52] [NOTICE] starting monitoring of node "psql-ha-postgresql-ha-postgresql-witness-0" (ID: 2000)
2023-11-20 12:17:52.174 GMT [237] LOG:  checkpoint starting: time

Delete NODE 0 to simulate a failover situation and see other NODE has been elected as PRIMARY:

$ kubectl delete pod/psql-ha-postgresql-ha-postgresql-0
$ kubectl logs psql-ha-postgresql-ha-postgresql-witness-0
...
[2023-11-20 12:17:56] [WARNING] unable to ping "user=repmgr password=FAigN0WLo4 host=psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-11-20 12:17:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-20 12:17:56] [WARNING] unable to ping "user=repmgr password=FAigN0WLo4 connect_timeout=5 dbname=repmgr host=psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-11-20 12:17:56] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-20 12:17:59] [WARNING] unable to ping "user=repmgr password=FAigN0WLo4 connect_timeout=5 dbname=repmgr host=psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-11-20 12:17:59] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-20 12:17:59] [WARNING] unable to reconnect to node 1000 after 2 attempts
[2023-11-20 12:18:01] [NOTICE] witness node "psql-ha-postgresql-ha-postgresql-witness-0" (ID: 2000) now following new primary node "psql-ha-postgresql-ha-postgresql-1" (ID: 1001)

Wait for NODE 0 to come up and get elected as a STANDBY node:

$  kubectl logs psql-ha-postgresql-ha-postgresql-0
postgresql-repmgr 12:19:07.57 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 12:19:07.58 DEBUG ==> Checking node 'psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'...
psql: error: connection to server at "psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local" (10.244.0.238), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
postgresql-repmgr 12:19:07.59 DEBUG ==> Skipping: failed to get primary from the node 'psql-ha-postgresql-ha-postgresql-0.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'!
postgresql-repmgr 12:19:07.59 DEBUG ==> Checking node 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'...
postgresql-repmgr 12:19:07.88 DEBUG ==> Pretending primary role node - 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'
postgresql-repmgr 12:19:07.88 DEBUG ==> Pretending primary set to 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'!
postgresql-repmgr 12:19:07.89 DEBUG ==> Checking node 'psql-ha-postgresql-ha-postgresql-2.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'...
psql: error: could not translate host name "psql-ha-postgresql-ha-postgresql-2.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local" to address: Name or service not known
postgresql-repmgr 12:19:07.90 DEBUG ==> Skipping: failed to get primary from the node 'psql-ha-postgresql-ha-postgresql-2.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'!
postgresql-repmgr 12:19:07.91 INFO  ==> Auto-detected primary node: 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'
postgresql-repmgr 12:19:07.91 INFO  ==> This node was acting as a primary before restart!
postgresql-repmgr 12:19:07.91 INFO  ==> Current master is 'psql-ha-postgresql-ha-postgresql-1.psql-ha-postgresql-ha-postgresql-headless.default.svc.cluster.local:5432'. Cloning/rewinding it and acting as a standby node...

Nov 20 '23 13:11 joancafom

Hi @joancafom I am doing the exact steps you're doing but the comeback is happening oddly.

What's interesting is that if I delete it again it will resolve correctly meaning, if I delete the primary and delete it when it comes back another subsequent time it will resolve back to normal.

Perhaps I don't have enough compute at 600m per node i've tried 1000m and it makes no difference. As well, can you tell me your key values configuration of the pg server repmgr and witness. Such as the readiness and liveness probes. I don't think startup probes are relevant here. Perhaps you have them as default?

As well, I the code seems to be getting caught in the repmgr_get_upstream_node() loop. For some reason when it comes back this is happening.

The way the code reads is such that it will check to see if itself is a primary node. If it finds that it is a primary node and there is another primary node it will capture that node and conflict of pretending primary role nodes. The issue as I see it is that A. I don't believe the Witness is involved to discover if there is a way to identify the true primary through this area of code.

            elif [[ "$(echo "$primary_conninfo" | wc -l)" -eq 1 ]]; then
                suggested_primary_host="$(echo "$primary_conninfo" | awk -F 'host=' '{print $2}' | awk '{print $1}')"
                suggested_primary_port="$(echo "$primary_conninfo" | awk -F 'port=' '{print $2}' | awk '{print $1}')"
                debug "Pretending primary role node - '${suggested_primary_host}:${suggested_primary_port}'"
                if [[ -n "$pretending_primary_host" ]]; then
                    if [[ "${pretending_primary_host}:${pretending_primary_port}" != "${suggested_primary_host}:${suggested_primary_port}" ]]; then
                        warn "Conflict of pretending primary role nodes (previously: '${pretending_primary_host}:${pretending_primary_port}', now: '${suggested_primary_host}:${suggested_primary_port}')"
                        pretending_primary_host="" && pretending_primary_port="" && break
                    fi

Nov 22 '23 01:11 xtianus79

I think I solved it with getting the timings right. The main issue is that PG_REWIND is broken and I think I figured out why over on this post. This was leading to the confusion of changing the settings in different ways but it still not working. When I removed the rewind I got the settings right and it works as expected.

The Rewind issue here is very related to this and the entire can't find the repmgr_slot_xxxx missing issue writ large.

https://github.com/bitnami/containers/issues/52213

Nov 27 '23 06:11 xtianus79

As mentioned in the associated PR, we have created an internal task to review this specific case. The thread will be updated when we have more news 😁

Dec 11 '23 16:12 joancafom

I just had a scenario when all my pods were starting up after applying a setup change, they all became primary.

Script to install / upgrade in namespace postgres

#!/usr/bin/env bash

PASSWORD=$(kubectl get secret --namespace "postgres" postgres-postgresql-ha-postgresql -o jsonpath="{.data.password}" | base64 -d)
REPMGR_PASSWORD=$(kubectl get secret --namespace "postgres" postgres-postgresql-ha-postgresql -o jsonpath="{.data.repmgr-password}" | base64 -d)
ADMIN_PASSWORD=$(kubectl get secret --namespace "postgres" postgres-postgresql-ha-pgpool -o jsonpath="{.data.admin-password}" | base64 -d)

helm upgrade --install postgres oci://registry-1.docker.io/bitnamicharts/postgresql-ha --namespace postgres --version 12.3.2 --values values.yaml --set postgresql.password=$PASSWORD --set postgresql.repmgrPassword=$REPMGR_PASSWORD --set pgpool.adminPassword=$ADMIN_PASSWORD

Values

pgpool:
  customUsers:
    usernames: "**********************"
    passwords: "**************************"
  replicaCount: 3
  resources:
    limits:
      cpu: "3000m"
      memory: "1Gi"
    requests:
      cpu: "300m"
      memory: "500Mi"
postgresql:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - postgresql
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - postgres
        topologyKey: "kubernetes.io/hostname"
  livenessProbe:
    initialDelaySeconds: 3600
#    enabled: false
  usePgRewind: true
  extendedConf: |-
    wal_log_hints = on
  resources:
    limits:
      cpu: "3000m"
      memory: "4Gi"
    requests:
      cpu: "1000m"
      memory: "1Gi"
volumePermissions:
  enabled: true

1st pod logs

postgresql-repmgr 23:21:02.45 INFO  ==> Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 23:21:02.45 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 23:21:02.45 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 23:21:02.46 INFO  ==> 
postgresql-repmgr 23:21:02.48 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 23:21:02.51 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 23:21:02.51 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 23:21:02.52 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 23:21:03.20 WARN  ==> Conflict of pretending primary role nodes (previously: 'postgres-postgresql-ha-postgresql-1.postgres-postgresql-ha-postgresql-headless.postgres.svc.cluster.local:5432', now: 'postgres-postgresql-ha-postgresql-2.postgres-postgresql-ha-postgresql-headless.postgres.svc.cluster.local:5432')
postgresql-repmgr 23:21:03.20 INFO  ==> This node was acting as a primary before restart!
postgresql-repmgr 23:21:03.20 INFO  ==> Can not find new primary. Starting PostgreSQL normally...
postgresql-repmgr 23:21:03.21 INFO  ==> There are no nodes with primary role. Assuming the primary role...
postgresql-repmgr 23:21:03.22 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 23:21:03.22 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 23:21:03.34 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 23:21:03.35 INFO  ==> Initializing Repmgr...
postgresql-repmgr 23:21:03.36 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 23:21:03.37 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 23:21:03.37 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 23:21:03.40 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 23:21:03.42 INFO  ==> Configuring replication parameters
postgresql-repmgr 23:21:03.45 INFO  ==> Configuring fsync
postgresql-repmgr 23:21:03.46 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **

postgresql-repmgr 23:21:03.49 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2023-12-18 23:21:03.739 GMT [179] LOG:  pgaudit extension initialized
2023-12-18 23:21:03.806 GMT [179] LOG:  redirecting log output to logging collector process
2023-12-18 23:21:03.806 GMT [179] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2023-12-18 23:21:03.806 GMT [179] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-12-18 23:21:03.807 GMT [179] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-18 23:21:03.807 GMT [179] LOG:  listening on IPv6 address "::", port 5432
2023-12-18 23:21:03.876 GMT [179] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-12-18 23:21:03.939 GMT [183] LOG:  database system was shut down at 2023-12-18 21:38:19 GMT
2023-12-18 23:21:04.200 GMT [179] LOG:  database system is ready to accept connections
 done
server started
postgresql-repmgr 23:21:04.27 INFO  ==> ** Starting repmgrd **
[2023-12-18 23:21:04] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-12-18 23:21:04] [NOTICE] starting monitoring of node "postgres-postgresql-ha-postgresql-0" (ID: 1000)
[2023-12-18 23:21:04] [NOTICE] monitoring cluster primary "postgres-postgresql-ha-postgresql-0" (ID: 1000)

2nd pod logs

postgresql-repmgr 21:28:06.18 INFO  ==> Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 21:28:06.18 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 21:28:06.18 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 21:28:06.18 INFO  ==> 
postgresql-repmgr 21:28:06.21 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 21:28:06.23 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 21:28:06.24 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 21:28:06.24 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 21:28:06.37 WARN  ==> Conflict of pretending primary role nodes (previously: 'postgres-postgresql-ha-postgresql-0.postgres-postgresql-ha-postgresql-headless.postgres.svc.cluster.local:5432', now: 'postgres-postgresql-ha-postgresql-2.postgres-postgresql-ha-postgresql-headless.postgres.svc.cluster.local:5432')
postgresql-repmgr 21:28:06.37 INFO  ==> This node was acting as a primary before restart!
postgresql-repmgr 21:28:06.37 INFO  ==> Can not find new primary. Starting PostgreSQL normally...
postgresql-repmgr 21:28:06.38 INFO  ==> There are no nodes with primary role. Assuming the primary role...
postgresql-repmgr 21:28:06.38 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 21:28:06.39 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 21:28:06.51 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 21:28:06.52 INFO  ==> Initializing Repmgr...
postgresql-repmgr 21:28:06.53 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 21:28:06.54 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 21:28:06.54 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 21:28:06.57 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 21:28:06.59 INFO  ==> Configuring replication parameters
postgresql-repmgr 21:28:06.62 INFO  ==> Configuring fsync
postgresql-repmgr 21:28:06.63 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **

postgresql-repmgr 21:28:06.67 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2023-12-18 21:28:06.861 GMT [179] LOG:  pgaudit extension initialized
2023-12-18 21:28:07.101 GMT [179] LOG:  redirecting log output to logging collector process
2023-12-18 21:28:07.101 GMT [179] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2023-12-18 21:28:07.101 GMT [179] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-12-18 21:28:07.102 GMT [179] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-18 21:28:07.102 GMT [179] LOG:  listening on IPv6 address "::", port 5432
2023-12-18 21:28:07.146 GMT [179] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-12-18 21:28:07.233 GMT [183] LOG:  database system was shut down at 2023-12-18 21:12:36 GMT
2023-12-18 21:28:07.369 GMT [179] LOG:  database system is ready to accept connections
 done
server started
postgresql-repmgr 21:28:07.42 INFO  ==> ** Starting repmgrd **
[2023-12-18 21:28:07] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-12-18 21:28:07] [NOTICE] starting monitoring of node "postgres-postgresql-ha-postgresql-1" (ID: 1001)
[2023-12-18 21:28:07] [NOTICE] monitoring cluster primary "postgres-postgresql-ha-postgresql-1" (ID: 1001)

3rd pod logs:

postgresql-repmgr 03:27:30.38 INFO  ==> Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 03:27:30.39 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 03:27:30.39 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 03:27:30.39 INFO  ==> 
postgresql-repmgr 03:27:30.42 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 03:27:30.46 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 03:27:30.47 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 03:27:30.48 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 03:27:30.60 WARN  ==> Conflict of pretending primary role nodes (previously: 'postgres-postgresql-ha-postgresql-0.postgres-postgresql-ha-postgresql-headless.postgres.svc.cluster.local:5432', now: 'postgres-postgresql-ha-postgresql-1.postgres-postgresql-ha-postgresql-headless.postgres.svc.cluster.local:5432')
postgresql-repmgr 03:27:30.60 INFO  ==> This node was acting as a primary before restart!
postgresql-repmgr 03:27:30.60 INFO  ==> Can not find new primary. Starting PostgreSQL normally...
postgresql-repmgr 03:27:30.61 INFO  ==> There are no nodes with primary role. Assuming the primary role...
postgresql-repmgr 03:27:30.62 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 03:27:30.62 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 03:27:30.79 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 03:27:30.82 INFO  ==> Initializing Repmgr...
postgresql-repmgr 03:27:30.83 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 03:27:30.84 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 03:27:30.85 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 03:27:30.89 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 03:27:30.92 INFO  ==> Configuring replication parameters
postgresql-repmgr 03:27:30.96 INFO  ==> Configuring fsync
postgresql-repmgr 03:27:30.98 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **

postgresql-repmgr 03:27:31.02 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2023-12-19 03:27:31.154 GMT [173] LOG:  pgaudit extension initialized
2023-12-19 03:27:31.203 GMT [173] LOG:  redirecting log output to logging collector process
2023-12-19 03:27:31.203 GMT [173] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2023-12-19 03:27:31.203 GMT [173] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-12-19 03:27:31.205 GMT [173] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-19 03:27:31.205 GMT [173] LOG:  listening on IPv6 address "::", port 5432
2023-12-19 03:27:31.240 GMT [173] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-12-19 03:27:31.286 GMT [177] LOG:  database system was shut down at 2023-12-19 03:27:15 GMT
2023-12-19 03:27:31.378 GMT [173] LOG:  database system is ready to accept connections
 done
server started
postgresql-repmgr 03:27:31.46 INFO  ==> ** Starting repmgrd **
[2023-12-19 03:27:31] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-12-19 03:27:31] [NOTICE] starting monitoring of node "postgres-postgresql-ha-postgresql-2" (ID: 1002)
[2023-12-19 03:27:31] [NOTICE] monitoring cluster primary "postgres-postgresql-ha-postgresql-2" (ID: 1002)

Dec 19 '23 04:12 seang96

Have you tried solving this with a custom livenessProbe? Something along the lines of repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf node check -downstream | grep -E -v "CRITICAL.*((1 of 1)|(2 of 2).* of course this will have to be adjusted according to the amount of pods in the StatefulSet you have.

Jan 24 '24 14:01 DorBreger

Changing liveness probes was also the suggestion in the previous issue that was opened as well. But changing the liveness probe doesn't change the behaviour of the postgres cluster electing two leaders and getting split brain...

The original issue (#10821) spawned #21614 which got closed in favor of this issue.

I'm not sure why, but in #10821 they suggested opening a new issue instead of continuing to follow up on it. I recommend reading the discussion on #10821 and coming back after having that context. The last ~10 comments on it have some very insightful information. I'd say those comments lay some good groundwork as a place to start investigation.

At our company we have completely stopped using Postgres due to this issue.

Jan 24 '24 16:01 QuinnBast

I have noticed an interesting phenomenon. When a sts restarts after it has already been created instead of a log saying that it will look for a primary and will retry 10 times with 6 second interval it says something along the lines of 'This node was acting as primary before a restart!'

Mar 11 '24 11:03 DorBreger

We had also problems with brain split, even with witness node.

After a while we saw some usable log messages, like

FATAL: remaining connection slots are reserved for non-replication superuser connections
FATAL: sorry, too many clients already
LOG: pool_read_kind: error message from 2 th backend:remaining connection slots are reserved for non-replication superuser connections
ERROR: unable to read message kind

It seems, we had low max_connections limit in postgres and sql connections from apps used them all. So health check / pgpool was also not able to communicate with postgres pods. If you log in to the postgres and check actual connections, you may see what apps are consuming them all. (SELECT * FROM pg_stat_activity; )

Otherwise it seems we have some limits set too low on the host machines also (e.g.: open files, fs.inotify.max_user_instances, fs.inotify.max_user_watches... ), but not sure if they are direclty related to the split brain problems.

Edit: later it failed also with increased connection limits.

Mar 12 '24 14:03 azbesthu

Hitting the same issue. When the primary goes for a restart , a current standby assumes the primary role, when the old primary comes up it assumes the primary role as well

Before

 ID   | Name                               | Role    | Status    | Upstream                           | Location | Prio. | TLI
------+------------------------------------+---------+-----------+------------------------------------+----------+-------+-----
 1000 | metrics-postgresql-ha-postgresql-0 | primary | * running |                                    | default  | 100   | 3   
 1001 | metrics-postgresql-ha-postgresql-1 | standby |   running | metrics-postgresql-ha-postgresql-0 | default  | 100   | 3   
 1002 | metrics-postgresql-ha-postgresql-2 | standby |   running | metrics-postgresql-ha-postgresql-0 | default  | 100   | 3

After the primary node (pod) restarted

 ID   | Name                               | Role    | Status               | Upstream                             | Location | Prio. | TLI
------+------------------------------------+---------+----------------------+--------------------------------------+----------+-------+-----
 1000 | metrics-postgresql-ha-postgresql-0 | primary | * running            |                                      | default  | 100   | 3   
 1001 | metrics-postgresql-ha-postgresql-1 | standby | ! running as primary |                                      | default  | 100   | 3   
 1002 | metrics-postgresql-ha-postgresql-2 | standby |   running            | ! metrics-postgresql-ha-postgresql-1 | default  | 100   | 3   

WARNING: following issues were detected
  - node "metrics-postgresql-ha-postgresql-1" (ID: 1001) is registered as standby but running as primary
  - node "metrics-postgresql-ha-postgresql-2" (ID: 1002) reports a different upstream (reported: "metrics-postgresql-ha-postgresql-1", expected "metrics-postgresql-ha-postgresql-0")

Some of the relevant log snippets form metrics-postgresql-ha-postgresql-1 (the promoted standby)

2024-03-26 13:09:34.172 GMT [234] LOG:  replication terminated by primary server
2024-03-26 13:09:34.172 GMT [234] DETAIL:  End of WAL reached on timeline 3 at 164/250000A0.
2024-03-26 13:09:34.172 GMT [234] FATAL:  could not send end-of-streaming message to primary: SSL connection has been closed unexpectedly
	no COPY in progress
2024-03-26 13:09:34.173 GMT [230] LOG:  invalid record length at 164/250000A0: wanted 24, got 0
2024-03-26 13:09:34.350 GMT [714] FATAL:  could not connect to the primary server: connection to server at "metrics-postgresql-ha-postgresql-0.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local" (192.168.3.30), port 5432 failed: server closed the connection unexpectedly
		This probably means the server terminated abnormally
		before or while processing the request.



[2024-03-26 13:09:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-03-26 13:09:58] [WARNING] unable to reconnect to node "metrics-postgresql-ha-postgresql-0" (ID: 1000) after 5 attempts
[2024-03-26 13:09:58] [NOTICE] promotion candidate is "metrics-postgresql-ha-postgresql-1" (ID: 1001)
[2024-03-26 13:09:58] [NOTICE] this node is the winner, will now promote itself and inform other nodes


2024-03-26 13:09:58.498 GMT [230] LOG:  received promote request
2024-03-26 13:09:58.498 GMT [729] FATAL:  terminating walreceiver process due to administrator command
2024-03-26 13:09:58.499 GMT [230] LOG:  redo done at 164/25000028
2024-03-26 13:09:58.499 GMT [230] LOG:  last completed transaction was at log time 2024-03-26 13:09:31.439402+00
2024-03-26 13:09:59.119 GMT [230] LOG:  selected new timeline ID: 4
2024-03-26 13:09:59.167 GMT [230] LOG:  archive recovery complete
2024-03-26 13:09:59.201 GMT [228] LOG:  database system is ready to accept connections
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: standby promoted to primary after 1 second(s)
DEBUG: setting node 1001 as primary and marking existing primary as failed
DEBUG: begin_transaction()
DEBUG: commit_transaction()
NOTICE: STANDBY PROMOTE successful
DETAIL: server "metrics-postgresql-ha-postgresql-1" (ID: 1001) was successfully promoted to primary

Relevant logs from metrics-postgresql-ha-postgresql-0 (the old primary) after the restart

postgresql-repmgr 13:09:41.34 DEBUG ==> Checking node 'metrics-postgresql-ha-postgresql-1.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local:5432'...
postgresql-repmgr 13:09:44.48 DEBUG ==> Pretending primary role node - 'metrics-postgresql-ha-postgresql-0.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local:5432'
postgresql-repmgr 13:09:44.48 DEBUG ==> Pretending primary set to 'metrics-postgresql-ha-postgresql-0.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local:5432'!
postgresql-repmgr 13:09:44.49 INFO  ==> Auto-detected primary node: 'metrics-postgresql-ha-postgresql-0.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local:5432'
postgresql-repmgr 13:09:44.49 INFO  ==> This node was acting as a primary before restart!
postgresql-repmgr 13:09:44.49 INFO  ==> Can not find new primary. Starting PostgreSQL normally...
postgresql-repmgr 13:09:44.50 INFO  ==> There are no nodes with primary role. Assuming the primary role...



postgresql-repmgr 13:09:44.83 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **

postgresql-repmgr 13:09:44.85 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2024-03-26 13:09:45.029 GMT [166] LOG:  redirecting log output to logging collector process
2024-03-26 13:09:45.029 GMT [166] HINT:  Future log output will appear in directory "/opt/build/postgresql/logs".
2024-03-26 13:09:45.029 GMT [166] LOG:  starting PostgreSQL 13.11 (Debian 13.11-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2024-03-26 13:09:45.029 GMT [166] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-26 13:09:45.029 GMT [166] LOG:  listening on IPv6 address "::", port 5432
2024-03-26 13:09:45.040 GMT [166] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-26 13:09:45.061 GMT [168] LOG:  database system was shut down at 2024-03-26 13:09:34 GMT
2024-03-26 13:09:45.091 GMT [166] LOG:  database system is ready to accept connections
 done
server started
postgresql-repmgr 13:09:45.19 INFO  ==> ** Starting repmgrd **
[2024-03-26 13:09:45] [NOTICE] repmgrd (repmgrd 5.2.0) starting up
INFO:  set_repmgrd_pid(): provided pidfile is /opt/build/repmgr/tmp/repmgr.pid
[2024-03-26 13:09:45] [NOTICE] starting monitoring of node "metrics-postgresql-ha-postgresql-0" (ID: 1000)
[2024-03-26 13:09:45] [NOTICE] monitoring cluster primary "metrics-postgresql-ha-postgresql-0" (ID: 1000)

The primary went for a restart at around the 13:09:34 mark. The standby detected the same and made 5 attempts to connect to the primary and around 13:09:58 after the attempts exhausted a new promotion candidate was identified.

Based on the primary logs, it was up and running around the 13:09:45 mark, ideally the standby should have detected that the primary to be up and running and the promotion at 13:09:58 should not have happened.

Mar 26 '24 14:03 sonawaneshivraj

How this chart can be even named as "HA" if it doesn't play its basic role? My cluster is broken now, it has THREE primaries of THREE replicas. Why are you leading people to wrong way? Please don't name it HA then! Better to use single node than such "HA"!

Mar 28 '24 19:03 digitalstudium

@digitalstudium I'd recommend ditching this as a way to do HA postgres and to look into operators. There are couple operators for postgres and they are miles ahead in quality and capability than this chart.

Mar 28 '24 19:03 seang96

@seang96 Thanks a lot for advice! Will try

Mar 28 '24 19:03 digitalstudium

I figured out a way how to consistently reproduce the issue. Basically, it happens when a node experiences critical failure (such as OOM kill, failure of pre-stop script, node shutdown, or losing network connectivity) and pre-stop script is not executed/finished.

To syntactically reproduce it I use kubectl delete pod --grace-period 0 --force, there are full instructions:

Install chart with default configuration helm install pg oci://registry-1.docker.io/bitnamicharts/postgresql-ha
Wait all nodes are ready and connected. Use kubectl exec -it pg-postgresql-ha-postgresql-1 -- /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show to check cluster status
Force delete primary node kubectl delete pod --grace-period 0 --force pg-postgresql-ha-postgresql-0
Wait until pod is recreated

Check replication status on different replicas:

pg-postgresql-ha-postgresql-0:

$ kubectl exec -it pg-postgresql-ha-postgresql-0 -- /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show

 ID   | Name                          | Role    | Status               | Upstream                        | Location | Priority | Timeline | Connection string                                                                                                                                                           
------+-------------------------------+---------+----------------------+---------------------------------+----------+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1000 | pg-postgresql-ha-postgresql-0 | primary | * running            |                                 | default  | 100      | 1        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-0.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1001 | pg-postgresql-ha-postgresql-1 | standby | ! running as primary |                                 | default  | 100      | 2        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-1.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1002 | pg-postgresql-ha-postgresql-2 | standby |   running            | ! pg-postgresql-ha-postgresql-1 | default  | 100      | 1        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-2.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5

WARNING: following issues were detected
  - node "pg-postgresql-ha-postgresql-1" (ID: 1001) is registered as standby but running as primary
  - node "pg-postgresql-ha-postgresql-2" (ID: 1002) reports a different upstream (reported: "pg-postgresql-ha-postgresql-1", expected "pg-postgresql-ha-postgresql-0")

pg-postgresql-ha-postgresql-1:

$ kubectl exec -it pg-postgresql-ha-postgresql-0 -- /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show

 ID   | Name                          | Role    | Status    | Upstream                      | Location | Priority | Timeline | Connection string                                                                                                                                                           
------+-------------------------------+---------+-----------+-------------------------------+----------+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1000 | pg-postgresql-ha-postgresql-0 | primary | ! running |                               | default  | 100      | 1        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-0.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1001 | pg-postgresql-ha-postgresql-1 | primary | * running |                               | default  | 100      | 2        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-1.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1002 | pg-postgresql-ha-postgresql-2 | standby |   running | pg-postgresql-ha-postgresql-1 | default  | 100      | 2        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-2.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5

WARNING: following issues were detected
  - node "pg-postgresql-ha-postgresql-0" (ID: 1000) is running but the repmgr node record is inactive

pg-postgresql-ha-postgresql-2:

$ kubectl exec -it pg-postgresql-ha-postgresql-2 -- /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show

 ID   | Name                          | Role    | Status    | Upstream                      | Location | Priority | Timeline | Connection string                                                                                                                                                           
------+-------------------------------+---------+-----------+-------------------------------+----------+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1000 | pg-postgresql-ha-postgresql-0 | primary | ! running |                               | default  | 100      | 1        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-0.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1001 | pg-postgresql-ha-postgresql-1 | primary | * running |                               | default  | 100      | 2        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-1.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1002 | pg-postgresql-ha-postgresql-2 | standby |   running | pg-postgresql-ha-postgresql-1 | default  | 100      | 1        | user=repmgr password=JT5Ul0ddsR host=pg-postgresql-ha-postgresql-2.pg-postgresql-ha-postgresql-headless.postgres.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5

WARNING: following issues were detected
  - node "pg-postgresql-ha-postgresql-0" (ID: 1000) is running but the repmgr node record is inactive

I used a kind cluster with 1 master node and 3 worker nodes but I think it will work regardless of cluster configuration

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

May 02 '24 14:05 Nitive

@joancafom what is the state of this issue? As you can see, the problem can be reproduced by several users.

With the current bug the bitnami/postgresql-ha is not usable in my opinion, because every day we run into the same issue that all postgres instances are changed to "Primary" state. This in turn leads to data inconsistencies. We created a custom Pod which recognises the multiple primary issue. As soon as more than one instance become primary we restart the Postgres-StatefulSet.

May 13 '24 15:05 husira

@husira I'd mirror the sentiments of @seang96's comment above. Use a Postgres Kubernetes operator like Cloud Native PG or Zalando.

May 13 '24 19:05 QuinnBast

@QuinnBast do you have any experience with Cloud Native PG or Zalando? Which one do you recommend?

May 13 '24 19:05 DorBreger

@QuinnBast thank you for your comment. We already use the Zalando Postgres Operator in some projects.

The reason we switched to the Bitnami postgresql-ha chart for some applications was that we didn't want to have operators, because of the simplicity. We have a full GitOps approach and didn’t want an operator which creates additional resources out of our control. Bitnami postgresql-ha seems to be the only solution we found. If the whole thing is stable, we would be satisfied, but the number of bugs makes us doubt the quality of this solution.

@DorBreger The Zalando Postgres operator is good and runs extremely stable. The Zalando Postgres operator was certainly one of the first open-source Postgres operator projects. Due to the history this is also noticeable in comparison with more „modern or cloud-native" operators such as CloudNativePG or Stackgres.

There was an very interesting talk during the kubeCon 2024 where different operators were compared: https://www.youtube.com/watch?v=l33pcnQ4cUQ

In the near future we will certainly look for more modern (respectively more cloud-native) solutions like CloudNativePG or Stackgres. There is also a SIG (special interest group) which focus on DOK (Data on Kubernetes): https://dok.community/landscape/

If you are looking for Postgres operators, you will find solutions like CloudNativePG, Stackgres or Percona on the Landscape. Interestingly, the Zalando Postgres operator is not considered there.

May 13 '24 22:05 husira

I came across this post when trying to fix a split-brain issue. After lots of struggle, it turned out the easiest thing to do to fix the chicken-egg issue with the pods was to remove the liveness and readiness probes from the StatefulSet which allows both pods to boot up and then negotiate the primary/secondary without k8s getting in the way.

Jun 28 '24 20:06 clayroach

the same trouble with me

postgresql-repmgr 12:04:25.80 INFO  ==> Querying all partner nodes for common upstream node...


postgresql-repmgr 12:04:25.81 DEBUG ==> Checking node '192.169.111.148:5432'...
postgresql-repmgr 12:04:25.87 DEBUG ==> Pretending primary role node - '192.169.111.147:5432'
postgresql-repmgr 12:04:25.87 DEBUG ==> Pretending primary set to '192.169.111.147:5432'!


postgresql-repmgr 12:04:25.88 DEBUG ==> Checking node '192.169.111.149:5432'...
postgresql-repmgr 12:04:25.95 DEBUG ==> Pretending primary role node - '192.169.111.149:5432'


postgresql-repmgr 12:04:25.95 WARN  ==> Conflict of pretending primary role nodes (previously: '192.169.111.147:5432', now: '192.169.111.149:5432')


postgresql-repmgr 12:04:25.95 INFO  ==> There are no nodes with primary role. Assuming the primary role...

Jul 03 '24 01:07 kwenzh

I came across this post when trying to fix a split-brain issue. After lots of struggle, it turned out the easiest thing to do to fix the chicken-egg issue with the pods was to remove the liveness and readiness probes from the StatefulSet which allows both pods to boot up and then negotiate the primary/secondary without k8s getting in the way.

in my k8s cluster, there are no probes with pods , it also happen split-brain

Jul 03 '24 02:07 kwenzh

the same trouble with me

postgresql-repmgr 12:04:25.80 INFO  ==> Querying all partner nodes for common upstream node...


postgresql-repmgr 12:04:25.81 DEBUG ==> Checking node '192.169.111.148:5432'...
postgresql-repmgr 12:04:25.87 DEBUG ==> Pretending primary role node - '192.169.111.147:5432'
postgresql-repmgr 12:04:25.87 DEBUG ==> Pretending primary set to '192.169.111.147:5432'!


postgresql-repmgr 12:04:25.88 DEBUG ==> Checking node '192.169.111.149:5432'...
postgresql-repmgr 12:04:25.95 DEBUG ==> Pretending primary role node - '192.169.111.149:5432'


postgresql-repmgr 12:04:25.95 WARN  ==> Conflict of pretending primary role nodes (previously: '192.169.111.147:5432', now: '192.169.111.149:5432')


postgresql-repmgr 12:04:25.95 INFO  ==> There are no nodes with primary role. Assuming the primary role...

I believe one of the reasons for this issue is that after the current master node fails, this information is not detected by other nodes. This leads to inconsistent data being recorded in repmgr.nodes, which in turn causes a split-brain scenario. Additionally, when the failed node recovers, it does not rejoin the cluster but instead becomes isolated.

Jul 16 '24 08:07 kwenzh

Running the image in straight docker run ( no k8s) - shows that pg_rewind is fundamentally broken. There has been A LOT of scripting work put into this project that i can respect but there is something fundamentally wrong. I've run it in production for over a year and any time i have to make a change - i cringe. The fact that i can't change the parmeters on the slave and fail over to it w/ the new settings like i could with AWS RDS is troublesome.

Aug 03 '24 15:08 snoby

I probably found a way to mitigate the split-brain issue, which is to set a delay before start:

postgresql:
  extraInitContainers:
    - name: sleep
      image: busybox
      command: ['sleep', '15']

So when another node is newly elected as primary, the old primary node has more chance to know it after restart. (Previously, the old primary node may restart before knowing that there has already been another node elected as primary.)

I also noticed from the source code that if there were already multiple primary nodes running, the old primary node being restarted will simply ignore this abnormal situation, believe there's no primary node at the moment, and still set up itself as primary "normally" without any remedy, which is weird. Currently we may have to handle this situation manually, usually by deleting the wrong primary pods.

Aug 21 '24 05:08 hakula139

Just caught the same situation in my cluster which holds ~20Gb zabbix database:

postgresql-repmgr 20:51:51.53 INFO  ==> 

 ID   | Name                                          | Role    | Status               | Upstream                                | Location | Priority | Timeline | Connection string                                                                                                                                                                                  
------+-----------------------------------------------+---------+----------------------+-----------------------------------------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1000 | postgresql-postgresql-ha-postgresql-0         | primary | * running            |                                         | default  | 100      | 3        | user=repmgr password=s6npv5buytk27edm host=postgresql-postgresql-ha-postgresql-0.postgresql-postgresql-ha-postgresql-headless.zabbix.svc.cti.local dbname=repmgr port=5432 connect_timeout=5       
 1001 | postgresql-postgresql-ha-postgresql-1         | standby |   running            | ! postgresql-postgresql-ha-postgresql-2 | default  | 100      | 4        | user=repmgr password=s6npv5buytk27edm host=postgresql-postgresql-ha-postgresql-1.postgresql-postgresql-ha-postgresql-headless.zabbix.svc.cti.local dbname=repmgr port=5432 connect_timeout=5       
 1002 | postgresql-postgresql-ha-postgresql-2         | standby | ! running as primary |                                         | default  | 100      | 4        | user=repmgr password=s6npv5buytk27edm host=postgresql-postgresql-ha-postgresql-2.postgresql-postgresql-ha-postgresql-headless.zabbix.svc.cti.local dbname=repmgr port=5432 connect_timeout=5       
 2000 | postgresql-postgresql-ha-postgresql-witness-0 | witness | * running            | ! postgresql-postgresql-ha-postgresql-2 | default  | 0        | n/a      | user=repmgr password=s6npv5buytk27edm host=postgresql-postgresql-ha-postgresql-witness-0.postgresql-postgresql-ha-postgresql-witness.zabbix.svc.cti.local dbname=repmgr port=5432 connect_timeout=5

WARNING: following issues were detected
  - node "postgresql-postgresql-ha-postgresql-1" (ID: 1001) reports a different upstream (reported: "postgresql-postgresql-ha-postgresql-2", expected "postgresql-postgresql-ha-postgresql-0")
  - node "postgresql-postgresql-ha-postgresql-2" (ID: 1002) is registered as standby but running as primary
  - node "postgresql-postgresql-ha-postgresql-witness-0" (ID: 2000) reports a different upstream (reported: "postgresql-postgresql-ha-postgresql-2", expected "postgresql-postgresql-ha-postgresql-0")

Aug 27 '24 20:08 zentavr

I don't have issue with only one primary running and the rest of two standbys just standby. But in application, write query follow with a read can experience missing data. Is this a split-brain situation? I already have the syncRepliation turned on, and mode set to ANY.

Sep 22 '24 23:09 fzhan

I think its a bad design or a feature how repmgr handles switchover/failover in case the master node is not properly fenced out ( how is this achieved ? ) in case of restart ( accidental ) or nw issue it comes back and you have multiple master/splitbrain with limited way how to tackle. https://github.com/EnterpriseDB/repmgr/issues/617. I think basically the same happens not only with automatic but also manual switchover. Feels like this can't be used reliably in automated way - perhaps in manual handled setups ( dont forget to fence the broken primary as a first thing :) )

Sep 26 '24 08:09 posuch

charts charts copied to clipboard

[bitnami/postgresql-ha] There is a severe bug with the postgres-HA on K8 as the restarting down node REPMGR ignores the fact that there is a newly elected primary putting itself as the primary causing split brain. The witness doesn't help at all.

Name and Version

What architecture are you using?

What steps will reproduce the bug?

Are you using any custom parameters or values?

What is the expected behavior?

What do you see instead?

Additional information

charts
charts copied to clipboard