charts [bitnami/postgresql-ha] Postgresql-HA ends up with multiple primaries on node changes

Name and Version

bitnami/postgresql-ha

What steps will reproduce the bug?

Extremely hard to replicate, but we've seen it often enough now.

Essentially, the root cause is we change our nodes weekly in our cluster. We add one new node, uncordon + drain an old node, then add the new one via Cluster API.

Occasionally, about once a month, one of our databases ends up in a state where 2 postgresql pods from the postgresql-ha chart believe they are primary. This is usually resolves by just picking a pod and restarting it, then the postgresql pods will then re-nominate the primary.

Are you using any custom parameters or values?

No response

What is the expected behavior?

Would expect postgresql-ha to not get in this state. Totally understand there is a customXProbe value exposed, but was hoping we could land on a better default.

What do you see instead?

I believe altering the liveness probe or adding a startup to ensure that *-postgresql-0 is primary and *-postgresql-X is secondary would be best.

I believe this is related and could fix #8062 as well.

Could of ways to implement this:

psql -U postgres -c "SELECT pg_is_in_recovery()" is true for the primary and false for all others
psql -U postgres -c "SELECT pg_stat_receiver" returns contents if primary and there are secondaries. "SELECT pg_stat_wal_receiver" would return content on all secondaries.

For context, we always check via method 2, next time it occurs I'll be sure to check via 1 to ensure both are good ways to verify.

Additional information

No response

Sep 02 '22 16:09 ProNibs

Hi @ProNibs, thanks for reporting this! I have created an internal task for looking into this. It could be related to other issues like #2610, #5589 or others. Unfortunately I cannot give an ETA for when we are expecting to look into this issue, as we have a lot of things on our plate currently.

Sep 05 '22 17:09 marcosbc

Just an update -- we saw this again in production yesterday. One of our engineers caught pgpool running a failover command, so I think we have nailed down the root issue being that pgpool loses connection to the -0 node, does the failover command and elects a new leader. However, whenever kubernetes spins back the -0 node back up, it is hardcoded to assume it is the primary and thus the double-primary state begins.

I plan to spend some cycles this weekend to see if I can replicate this with a NetworkPolicy blocking traffic to the -0 node and watch pgpool perform the failover command and mess with some settings to disable it.

Sep 14 '22 13:09 ProNibs

Update: Using the base chart with postgresql.pgHbaTrustAll=true, I was able to replicate the issue with a Network Policy preventing .

Couple of interesting things:

After pgpool could no longer connect with postgresql-0, pgpool and postgresql-2 both restarted. postgresql-1 was now elected as primary and streaming to postgresql-2. I'm assuming the restart was for postgresql-2 to now recgonize postgresql-1 as primary. pgpool would also get in a non-ready state for a long time while trying to hit postgresql-0.

Deleting all of the pods at once results in nobody being primary as postgresql-1 will revert to being a secondary and pgpool will hang and get caught in an indefinite crash loop as it never has a primary. Re-introducing postgresql-0 does not result in reverting to a good state and restarting pods doesn't seem to work either. This conflicts with what we've done in production when experiencing these issues, but could be because we are still doing writes once in the double-primary mode.

Upping pgpool replicas from 1 to 2 make things even more weird. postgresql-0 stays primary for quite a bit longer until eventually postgresql-2 is put into an Error state and both pgpool pods restart. postgresql-1 becomes new primary once again. Re-introducing postgresql-0 and waiting about a minute or so will eventually cause both pgpool pods to detect postgresql-0 is back and restart themselves. pgpools will happily serve 2 primary nodes after their reboot:

2022-09-17 22:52:10.282: main pid 1: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2022-09-17 22:52:10.331: main pid 1: LOG:  find_primary_node: primary node is 0
2022-09-17 22:52:10.331: main pid 1: LOG:  find_primary_node: primary node is 1

For 2 replicas in pgpool deployment, only after re-introducing postgresql-0, restarting both pgpool pods, and then restarting postgresql-1 to postgresql-n do we back to the proper state. For 1 replicas in pgpool deployment, there does not seem to be a way to restore the primary/secondary linking without completely removing PVCs and starting everything from scratch. psql --h postgresql-ha-app-pgpool U postgres -c "SHOW pool_nodes;" never gets to a condition where pg_role goes back to primary-standby-standby even if role goes back to that state for 0/1/2 in that order.

NetworkPolicy I was applying/removing to test this out.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: postgresql-ha
spec:
  podSelector:
    matchLabels:  # Pick just postgresql-0 to apply
      statefulset.kubernetes.io/pod-name: postgresql-ha-app-postgresql-0
  policyTypes:
    - Ingress
  # Nothing below means deny-all

Sep 17 '22 23:09 ProNibs

After reviewing pgpool's docs on deploying to Kubernetes, it seems the default pgpool.conf file is incorrect for Kubernetes. If you review section 8.5.4, the pgpool.conf file should have the following:

backend_flag0 = 'ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER'
backend_flag1 = 'DISALLOW_TO_FAILOVER'
backend_flag2 = 'DISALLOW_TO_FAILOVER'

Currently, those are all "ALLOW_TO_FAILOVER" across the board.

Ref: https://www.pgpool.net/docs/pgpool-II-4.2.7/en/html/example-kubernetes.html

I believe it may be more appropriate to update the upstream bitnami pgpool image itself, I think it's this file, but I understand it could be more nuanced than that 🤔

However, in those same docs, it mentions that these can be overwritten with environment variables. I tried both with the environment variable settings as well as a copy+paste of the pgpool.conf file and it seems postgresql-repmgr will still self-elect a new leader if pgpool does not do it. See below for the logs from postgresql-1 and postgresql-2.

NOTICE: promoting standby to primary
DETAIL: promoting server "postgresql-ha-app-postgresql-1" (ID: 1001) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
2022-09-18 01:53:16.715 GMT [227] LOG:  received promote request

INFO: checking if node 1000 is primary
DEBUG: connecting to: "user=repmgr password=X connect_timeout=5 dbname=repmgr host=postgresql-ha-app-postgresql-0.postgresql-ha-app-postgresql-headless.postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "postgresql-ha-app-postgresql-0.postgresql-ha-app-postgresql-headless.postgresql-ha.svc.cluster.local" (100.96.4.135), port 5432 failed: timeout expired

DETAIL: attempted to connect using:
  user=repmgr password=X connect_timeout=5 dbname=repmgr host=postgresql-ha-app-postgresql-0.postgresql-ha-app-postgresql-headless.postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path=
INFO: checking if node 1001 is primary
DEBUG: connecting to: "user=repmgr password=X connect_timeout=5 dbname=repmgr host=postgresql-ha-app-postgresql-1.postgresql-ha-app-postgresql-headless.postgresql-ha.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 1001
INFO: connected to node 1001, checking for current primary
DEBUG: get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1001
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: follow target is primary node "postgresql-ha-app-postgresql-1" (ID: 1001)

Now to look if there are environment variables I can set related to repmgr...

Sep 18 '22 02:09 ProNibs

I have found the config in repmgr that is allowing this in /opt/bitnami/repmgr/conf/repmgr.conf. However, repmgr does not take environment variables as configuration.

I ended up having to mount a configMap where I change /opt/bitnami/scripts/librepmgr.sh to have failover='manual'. librepmgr.sh is defined here.

Once doing that, postgresql-0 going down results in pgpool failing it's ready check and postgresql-1 and above will restart and then sit in a waiting status for postgresql-0 to be up again and we avoid every hitting a double-primary state!

Sep 18 '22 13:09 ProNibs

Thanks for sharing! It is useful to know that failover='manual' is going to avoid hitting this issue for now, but I understand that the chart should be using failover='automatic', however, it does not seem to be working properly.

Unfortunately there has not been any progress in the internal task yet, but we will keep you posted if that changes.

Sep 23 '22 09:09 marcosbc

AS mentioned in the pgpool docs themselves, I'd like to point this out:

Pgpool-II's health check, automated failover, watchdog and online recovery features aren't required on Kubernetes. You need to only enable load balancing and connection pooling.

I don't think in Kubernetes we should be allowing any failovers in pgpool nor in repmgr. Having failover='automatic' in repmgr requires stopping of postgres and repmgr CLI commands, which is a non-starter in Kubernetes as that'd just kill the pod.

Changes need to be made to pgpool in addition to repmgr to ensure neither attempt to do failovers.

Ref: https://www.pgpool.net/docs/pgpool-II-4.2.7/en/html/example-kubernetes.html

Sep 23 '22 14:09 ProNibs

I see. You seem to be right indeed. Would you be up to contributing this fix as PRs? It would be good if it also allowed configuring the failover parameter value, in case the users wanted to set a different option in the container.

Sep 28 '22 08:09 marcosbc

Still a Github noob, didn't mean to have it close here.

Now that it's merged, I can make a MR.

Got a few high-level questions before I go edit the postgresql-ha chart: Do we want new values specifically for enabling pgpool vs repmgr failover? Only pgpool failover? Failover in general with specifically just pgpool?

Additionally, should I up the chart version? I ask because when I was doing testing, should postgresql-0 go down, all the other postgresql pods will go down and wait for the return of postgresql-0 -- postgresql-ha becomes more like "High Availability reads" as opposed to "High Availability" for writes

Sep 30 '22 14:09 ProNibs

We are fighting with this issue of having two masters in bitnami/postgresql-ha deployment too. The following two scenarios lead to the state in our case:

temporary network hiccup which takes longer than postgresql.repmgrReconnectAttempts * postgresql.repmgrReconnectInterval
occasionally master PG pod restart

I do not think @ProNibs solution should be default behavior of bitnami/postgresql-ha. Setting PG-0 pod to master with help of backend_flag0 = 'ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER' as suggested has consequences. One lose high availability when PG-0 is down. It makes PG-0 single point of failure. Document https://www.pgpool.net/docs/pgpool-II-4.2.7/en/html/example-kubernetes.html refers to k8s operators I have no experience with. It does not mention repmgr. Repmgr is capable to execute fail-over automatically. It does not require PG-0 to be master. For me, it is more like that author of the document did not consider repmgr or any tool with similar capabilities at all.

I hope It could be solved by proper configuration. I am currently testing the following approach to the problem:

repmgrFenceOldPrimary: true plus POSTGRESQL_CLUSTER_APP_NAME = <PG node name>- to protect against network hiccups. It has its downsides:
- works well with 3 node cluster only
- cannot be used with any features which requires POSTGRESQL_CLUSTER_APP_NAME to be list of all app names in the cluster
introduce init container for PG which waits postgresql.repmgrReconnectAttempts * postgresql.repmgrReconnectInterval - it prevents PG pod start time to be is similar to postgresql.repmgrReconnectAttempts * postgresql.repmgrReconnectInterval which could be potential race condition (I have no prove for it in the code) and it does not harm high availability because some standby is promoted to master in the meantime
PgPool reconfiguration which changes:
- liveness probe settings - to prevent PgPool pod restarts when any PG pod is not available
- health check config - even quick PG pod restart is detected by PgPool and failed node is marked as down before liveness probe restarts PgPool container

Any feedback to above config changes is welcomed.

I plan to create github issues on PgPool setup and improvements in repmgrFenceOldPrimary configuration where I provide more details why I it is needed and what is not working without it.

Oct 02 '22 23:10 pcerny

Additionally, should I up the chart version? I ask because when I was doing testing, should postgresql-0 go down, all the other postgresql pods will go down and wait for the return of postgresql-0 -- postgresql-ha becomes more like "High Availability reads" as opposed to "High Availability" for writes

@ProNibs I have a concern with this. If this is the case then there's not much difference between the regular PostgreSQL chart and PotgreSQL HA chart. We should find an alternative that allows multiple writes.

works well with 3 node cluster only

@pcerny Could you elaborate a bit more on this?

liveness probe settings - to prevent PgPool pod restarts when any PG pod is not available

I agree we should avoid restarting pgpool if the PostgreSQL nodes go down.

Oct 05 '22 08:10 javsalgar

Let me elaborate on statement "works well with 3 node cluster only" in relation to repmgrFenceOldPrimary: true setting.

First our motivation for covering network hiccups. We have deployments also in AWS. Some of "cluster with two primary" events were started by primary node not being reachable by standby nodes for something between 20 and 40 seconds.

I found two ways how to protect against network split in repmgr documentation. One is witness node which is not supported by bitnami/postgresql-ha. Another is Monitoring standby disconnections on the primary node which is enabled by repmgrFenceOldPrimary: true.

3 node cluster with repmgr option child_nodes_connected_min_count=1 behaves well because:

When master node is isolated repmgr figures out, it has no standby nodes and executes PG restart. Eventually, when network is back again, PG restart makes sure node will connect as standby to a new master node.
When standby node is isolated, nothing happens. Master with one standby runs and serves requests, isolated standby connects back once network hiccup is over. It is ok from PG and repmgr perspective. But it matters on which side of network hiccup are PgPool instances. So it is possible that network hiccup causes downtime anyway. Main benefit for us - once network gets back, the whole PG-HA deployment cures itself automatically.

As I write the comment I realized I was wrong. It works well for odd number of PG nodes. For example 5 node cluster needs child_nodes_connected_min_count=2, scenarios are the same as for 3 node cluster. But 4 node cluster cannot be protected this way against network hiccup because:

repmgr either requires master to have at least one standby alive - split to 2 groups of 2 nodes results in 2 master each with one standby
or repmgr requires master to have at least two standby alive - split to 2 groups of 2 nodes results in collapse of the whole cluster

So I would like to update my statement: It works well with odd number of PG nodes in the cluster.

Oct 05 '22 09:10 pcerny

We are ending in the same conditions.

3xPostgres Pods + 1xPgPool

If we create a network policy on the currently primary PostgreSQL pod, blocking all in/out traffic, simulating its node offline the result are

PgPool fails and loop in trying to connect to the old primary forever
Another primary is elected by the remaining two postgres, but pgpool does not fallback

if we then remove the policy, pgpool is now able to connect to the old primary but we end up with two primary nodes.

Tried both with Fencing enabled and PgPool with fallback enabled but nothing.

Right now this chart is not an HA and should be stated to not use in production.

Oct 05 '22 19:10 matteogazzadi

Great discussion here and why I was hesitant to immediately change anything yet. Also, good to hear other people having similar problems.

I have a feeling the best bet is to not have repmgr failover and only allow pgpool failover.

Reading through repmgr docs, it seems to require stopping postgresql and running repmgrd to elect new repmgr primaries which makes it difficult to instrument inside Kubernetes (stopping the process should stop the container).

With that thought, I think it'd be best to configure just one (pgpool or repmgr) to allow failover and figure the best way to allow for any postgresql pod to be the primary and have everything follow it. Would prefer to do in pgpool considering the repmgrd items mentioned above, but the pgpool docs kind of light on how to go about doing this.

Oct 05 '22 20:10 ProNibs

Dear, We think to have a found a combination of parameters that seems to solve the issue.

Note we run on a 3 nodes postgresql instance.

For postgresql we set:

Fencing enabled
Create a custom Liveness probe that if it detects the string running as primary, that is the one only seen on the old-primary, we fail the Liveness causing the POD to restart.


 postgresql:
   replicaCount: 3
   repmgrFenceOldPrimary: true
   livenessProbe:
     enabled: false
   customLivenessProbe:
     failureThreshold: 6
     initialDelaySeconds: 5
     periodSeconds: 5
     successThreshold: 1
     timeoutSeconds: 15
     exec:
       command:
         - bash
         - -c
         - |
           PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "$POSTGRES_USER" -d "$POSTGRES_DB" -h 127.0.0.1 -p 5432 -c "SELECT 1"
           if [[ $? -eq 0 ]]
           then 
             count=$(/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show | grep -i "running as primary" | wc -l)
             
             if [[ $count -gt 0 ]]
             then
               echo "Split Brain Detected"
               exit 1
             else
               exit 0
             fi
           else
             exit 1
           fi

On pgpool instead we enable the failover_on_backend_error and tune the healthcheck timeouts:

  pgpool: 
    # The parameters below are necessary to overcome the split-brain condition.
    configuration: |-
      failover_on_backend_error = on
      search_primary_node_timeout = 30
      health_check_max_retries='2'
      health_check_period='1'

Oct 08 '22 07:10 matteogazzadi

Thank you for the report! I will forward this to the team for evaluation.

Oct 10 '22 09:10 javsalgar

I'm just now getting to test this out myself -- trying it out with the configurations @matteogazzadi posted. It ensures HA of postgresql nodes, but not of connections through pgpool, which is a similar outcome to what I had proposed but still allows for direct connections to postgresql to still function until all postgresql pods are up.

I did alter the health check count command to be as follows: /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show --compact | grep -i "primary" | wc -l

Just to document what occurs for others: Network interruption to postgresql-0 results in repmgr electing postgresql-1 to be primary after timeout periods and while no traffic can go into postgresql-0, postgresql-0 can still send traffic out and it still recognized postgresql-1 as the primary and followed.

However, pgpool becomes unavailable until postgresql-0 becomes available again (after the pgpool health check for it officially fails).

Returning postgresql-0 to be available again does have it immediately join the repmgr cluster properly and any reboots of postgresql-0 has it respect postgresql-1 as the new repmgr primary and pgpool turns back to being ready.

With that in mind, @javsalgar where is the team on doing the evaluation?

Oct 25 '22 16:10 ProNibs

We did not want to go with failover_on_backend_error = on in pgpool. It marks node as down in pgpool immediately when ping to PG node fails. Health check functionality is preferred because it does several retries. After some testing our current configuration convergeted to code below. Config options are commented.

Goal of this setup is to keep restarts of pgpool on minimum (ideally none) and support:

master PG POD restart
network split
rolling upgrade of PG-HA cluster

The setup works well in my localhost test environment based on k3s/k3d. We have it deployed also on AWS testing cluster for about 2 weeks. We had no two-masters scenario since then but pgpool POD restarts from time to time, about once a day. I have not yet done investigation why is that.

I would love to know your opinion on that. Particularly on liveness probe check which is significantly extended in comparison to bitnami/pg-ha built-in one.

Postgres config

postgresql:
  # turn on fence old primary node - see https://repmgr.org/docs/repmgr.html#REPMGRD-PRIMARY-CHILD-DISCONNECTION
  # it works well for odd number of PG nodes in the cluster
  repmgrFenceOldPrimary: true
  # requirement of fencing
  # see https://repmgr.org/docs/repmgr.html#REPMGRD-PRIMARY-CHILD-DISCONNECTION-CAVEATS
  extraEnvVars:
    - name: POSTGRESQL_CLUSTER_APP_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
  # wait X seconds after PG is down to give pgpool enough time to mark node the node as down. It means pgpool does
  # not try to contact "down" node, i.e. SHOW pool_nodes will not be blocked in case of failed primary
  # X should be set to > pgpool CHECK_PERIOD + (CHECK_TIMEOUT + RETRY_DELAY) * MAX_RETRIES,
  # but it also takes some time to execute pre-stop code
  lifecycleHooks:
    preStop:
      exec:
        command:
          - bash
          - -c
          - /pre-stop.sh && sleep 20
  # make sure PG pod has enough time to terminate gracefully
  terminationGracePeriodSeconds: 60

PgPool config

pgpool:
  livenessProbe:
    enabled: false
  customLivenessProbe:
    periodSeconds: 5
    # prolong timeout so that pgpool healthcheck finishes in case of unreachable node
    # (long "show pool_nodes" and long test to add node back to up state)
    timeoutSeconds: 10
    # give repmgr enough time to setup new replication schema
    failureThreshold: 6
    exec:
      command:
        - bash
        - -c
        - |
          set -o errexit
          set -o nounset
          set -o pipefail

          # Load libraries
          . /opt/bitnami/scripts/libpgpool.sh

          # Load Pgpool env. variables
          eval "$(pgpool_env)"

          mod_pgpool_healthcheck() {
            info "Checking pgpool health..."
            local backends
            # it is enough to wait maximally 5 seconds for result, it actually blocks only if some backend is down
            # timeout should be in sync with liveness probe timeout and number of nodes which could be down together
            # NOTE: this command blocks indefinitely if only standby nodes are up - missing active primary node
            backends="$(PGCONNECT_TIMEOUT=5 PGPASSWORD="$PGPOOL_POSTGRES_PASSWORD" psql -U "$PGPOOL_POSTGRES_USERNAME" \
                -d postgres -h "$PGPOOL_TMP_DIR" -p "$PGPOOL_PORT_NUMBER" -tA -c "SHOW pool_nodes;")" || backends="command failed"
            if [[ "$backends" != "command failed" ]]; then
              # store backends to temporary file for evaluation in case of unreachable backends
              echo "$backends" > /tmp/pgpool_healthcheck_last_state.txt
              # look up backends that are marked offline and being up - attach only status=down and pg_status=up
              # situation down|down means PG is not yet ready to be attached
              for node in $(echo "${backends}" | grep "down|up" | tr -d ' '); do
                IFS="|" read -ra node_info <<< "$node"
                local node_id="${node_info[0]}"
                local node_host="${node_info[1]}"
                if [[ $(PGCONNECT_TIMEOUT=3 PGPASSWORD="${PGPOOL_POSTGRES_PASSWORD}" psql -U "${PGPOOL_POSTGRES_USERNAME}" \
                     -d postgres -h "${node_host}" -p "${PGPOOL_PORT_NUMBER}" -tA -c "SELECT 1" || true) == 1 ]]; then
                  # attach backend if it has come back online
                  pgpool_attach_node "${node_id}"
                fi
              done
            elif [[ -f "/tmp/pgpool_healthcheck_last_state.txt" ]]; then
              # some backends are not reachable or even active master is missing
              # support both down|up and down|down - primary can be down|down and pool_nodes blocked, so lets test on availability of old primary
              for node in $(grep "down" /tmp/pgpool_healthcheck_last_state.txt | tr -d ' '); do
                IFS="|" read -ra node_info <<< "$node"
                local node_id="${node_info[0]}"
                local node_host="${node_info[1]}"
                local DOWN_NODE_RECOVERY="command failed"
                DOWN_NODE_RECOVERY=$(PGCONNECT_TIMEOUT=3 PGPASSWORD="${PGPOOL_POSTGRES_PASSWORD}" psql -U "${PGPOOL_POSTGRES_USERNAME}" \
                     -d postgres -h "${node_host}" -p "${PGPOOL_PORT_NUMBER}" -tA -c "SELECT pg_is_in_recovery();") || DOWN_NODE_RECOVERY="command failed"
                if [[ "$DOWN_NODE_RECOVERY" == "t" ]]; then
                  # node in recovery - attach backend if it has come back online
                  pgpool_attach_node "${node_id}"
                elif [[ "$DOWN_NODE_RECOVERY" == "f" ]]; then
                  # node is running as primary - promote node in pgpool
                  PCPPASSFILE_PROMOTE=$(mktemp /tmp/pcppass-XXXXX-promote)
                  export PCPPASSFILE_PROMOTE
                  echo "localhost:9898:${PGPOOL_ADMIN_USERNAME}:${PGPOOL_ADMIN_PASSWORD}" >"${PCPPASSFILE_PROMOTE}"
                  pcp_promote_node -h localhost -U "${PGPOOL_ADMIN_USERNAME}" -p 9898 -n "${node_id}" -w
                  rm -rf "${PCPPASSFILE_PROMOTE}"
                fi
              done
              # consider healthcheck failed - there is only limited time to recover from last persisted state
              return 1
            else
              # no last state persisted and backends command failed
              return 1
            fi
          }
          mod_pgpool_healthcheck
  extraEnvVars:
    # make health check config more aggressive. It marks failed PG backend as down sooner. It should be later
    # than repmgr timeout, but it works well even if this condition is not met. Only downside is that client
    # connections could be terminated unnecessarily.
    # make sure CHECK_PERIOD > (CHECK_TIMEOUT + CHECK_RETRY_DELAY) * CHECK_MAX_RETRIES
    - name: PGPOOL_HEALTH_CHECK_PERIOD
      value: "13"
    - name: PGPOOL_HEALTH_CHECK_TIMEOUT
      value: "2"
    - name: PGPOOL_HEALTH_CHECK_MAX_RETRIES
      value: "3"
    - name: PGPOOL_HEALTH_CHECK_RETRY_DELAY
      value: "2"
  initdbScripts:
    # connect_timeout is used in general for connections from pgpool to PG backends, make it short otherwise
    # liveness probe healthcheck script is too long to execute ("show pool_nodes" takes connect_timeout + ~ 2 sec
    # for one unreachable node)
    01_set_timeouts.sh: |
      pgpool_set_property "connect_timeout" "2000"

Oct 25 '22 19:10 pcerny

Adding @rafariossaa to the thread

Oct 26 '22 13:10 javsalgar

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Nov 11 '22 01:11 github-actions[bot]

@javsalgar @rafariossaa do you plan to evaluate above mentioned solutions, pick some and put it into bitnami/postgresql-ha? We would love to have the issue addressed in upstream.

I can help with preparation of PR for the second one.

Nov 14 '22 12:11 pcerny

Hi, Sorry for the delay. Let me finish a couple of tasks that I have in my side right now and I will take a look to this. I would like to try reproduce the issue on my side and evaluate the proposals.

Nov 23 '22 08:11 rafariossaa

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Dec 09 '22 01:12 github-actions[bot]

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Dec 31 '22 01:12 github-actions[bot]

@rafariossaa I have created PRs to bitnami pgpool and postgresql-ha repositories to make my suggestion presented above easier to evaluate.

With these two PRs applied, the only change to PG-HA chart is to enable fencing mode for PG:

postgresql:
  # turn on fence old primary node - see https://repmgr.org/docs/repmgr.html#REPMGRD-PRIMARY-CHILD-DISCONNECTION
  # it works well for odd number of PG nodes in the cluster
  repmgrFenceOldPrimary: true
  # requirement of fencing
  # see https://repmgr.org/docs/repmgr.html#REPMGRD-PRIMARY-CHILD-DISCONNECTION-CAVEATS
  extraEnvVars:
    - name: POSTGRESQL_CLUSTER_APP_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name

Jan 04 '23 09:01 pcerny

Hi @pcerny , Thanks for the PR, we will review and merge it.

Jan 16 '23 08:01 rafariossaa

Hi,

The chart now includes the option to deploy witness nodes, could you take a look at this and check if that helps with the issue?

Going back to the description when the issue was opened, I think that for that case could also happen that when you delete one node of the cluster more than one postgres node is deleted. Could this happen?

If that is the case, maybe it is needed to evict the pods before shutting down the node, or, also play with the antiaffinity. This would depend on how big the Kubernetes cluster is and the PostgreSQL deployment.

Jan 30 '23 14:01 rafariossaa

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

Feb 24 '23 01:02 github-actions[bot]

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

Mar 01 '23 01:03 github-actions[bot]

Hi guys, same problem here, as @matteogazzadi mentioned above we used customLivenessProbe in order to recover a split brain scenario restarting the pod, unfortunately we saw after few days a slow and steady increase of memory which produce a restart of pod after few days. We also tried out enabling witness node in a configuration with 2 postgres (1 primary and 1 standby) replicas and 1 witness node, but applying any kind of isolation of primary, the standby is correctly promoted to new primary because witness can't communicate with it. This lead, when the primary comes back, to have the same split brain behavior without witness, but even with witness node no action is taken to demotes old primary node. Reading tons of documentations we found a new solution based on a custom configuration of repmgrd. Reading this page: https://repmgr.org/docs/repmgr.html#REPMGRD-PRIMARY-CHILD-DISCONNECTION-EVENTS

child_nodes_disconnect_command User-definable script to be executed when repmgrd determines that an insufficient number of child nodes are connected. By default the script is executed when no child nodes are executed, but the execution threshold can be modified by setting one of child_nodes_connected_min_count orchild_nodes_disconnect_min_count (see below).

The child_nodes_disconnect_command script can be any user-defined script or program. It must be able to be executed by the system user under which the PostgreSQL server itself runs (usually postgres).

Note If child_nodes_disconnect_command is not set, no action will be taken.

If specified, the following format placeholder will be substituted when executing child_nodes_disconnect_command:

%p ID of the node executing the child_nodes_disconnect_command script.

The child_nodes_disconnect_command script will only be executed once while the criteria for its execution are met. If the criteria for its execution are no longer met (i.e. some child nodes have reconnected), it will be executed again if the criteria for its execution are met again.

The child_nodes_disconnect_command script will not be executed if repmgrd is paused.

We used this customizable command to use like this.

postgresql: 
  repmgrFenceOldPrimary: true
  repmgrConfiguration : | 
    child_nodes_disconnect_command='/bin/bash -c "while (! pg_isready -h {{ .Release.Name }}-db-postgresql-witness.{{ .Release.Namespace }}.svc.cluster.local -U postgres -t 300) do sleep 10; done; kill -TERM 1"'
    child_nodes_connected_include_witness='true'
  livenessProbe:
    enabled: true

As you can see, this script loops indefinitely until witness node will be reachable again. When this happen that means the node is rejoined in the cluster and is available again but as primary, so we terminate the node in order to restart and force to raise up as standby node. This configuration seems to work so far even with different kind of node isolation and different clusters.

For avoiding data loss or some kind of inconsistency we're considering the use of postgresql.UsePgRewind.

Do you have any drawback in this approach that we can't see? I'd love to have your opinion on that, would be appreciated.

Thank you all for your support

Mar 31 '23 16:03 andreapulvirenti

charts charts copied to clipboard

[bitnami/postgresql-ha] Postgresql-HA ends up with multiple primaries on node changes

Name and Version

What steps will reproduce the bug?

Are you using any custom parameters or values?

What is the expected behavior?

What do you see instead?

Additional information

charts
charts copied to clipboard