replica node db failed to start after removing its $PGDATA
Overview
Create a Postgres Cluster with 3 replicas. The replica node failed to start after removing $PGDATA
Environment
Please provide the following details:
- Platform: (
Kubernetes) - Platform Version: (
v1.23.7) - PGO Image Tag: (e.g.
ubi8-5.6.0-0) - Postgres Version (e.g.
16) - Storage: (
hostpath)
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
-
run 'kubectl apply -k kustomize/postgres'
-
run 'kubectl get pv | grep repo'
-
run 'kubectl get pv pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281 -o yaml'
below is the result: ............. the hostpath is: hostPath: path: /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1 ..............
-
delete all under repo's hostpath: run 'cd /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1' run 'rm -rf *'
-
run 'kubectl get pods -n postgres-operator' below is the result: NAME READY STATUS RESTARTS AGE hippo-backup-xt6v-k445d 0/1 Completed 0 36m hippo-instance1-24sl-0 4/4 Running 0 21m hippo-instance1-9n7g-0 4/4 Running 0 14m hippo-instance1-l7h5-0 3/4 Running 0 19m hippo-repo-host-0 2/2 Running 0 36m pgo-7784d579df-glpz6 1/1 Running 0 47m
-
find one replica nodes run 'kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh' below is the result: [root@k8s-master postgres]# kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh sh-4.4$ patronictl list
- Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------+-----------------------------------+---------+-----------+----+-----------+ | hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | | | hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 | | hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown | +------------------------+-----------------------------------+---------+-----------+----+-----------+ sh-4.4$
-
go to the replica's pv, and remove $PGDATA
- run 'kubectl get pv | grep l7h5' below is the result: pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 1Gi RWO Delete Bound postgres-operator/hippo-instance1-l7h5-pgdata hostpath 39m
2)run 'kubectl get pv pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 -o yaml' below is the result: ................ hostPath: path: /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata 3) run 'cd /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata‘ 4) run 'rm -rf pg16'
-
check postgres cluster status run: patronictl list below is the result: sh-4.4$ patronictl list
- Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------+-----------------------------------+---------+-----------+----+-----------+ | hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | | | hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 | | hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown | +------------------------+-----------------------------------+---------+-----------+----+-----------+ sh-4.4$
EXPECTED
- the member 'hippo-instance1-l7h5-0' state shoud be streaming.
ACTUAL
- the member 'hippo-instance1-l7h5-0' state shoud be stopped.
Logs
2025-03-04 06:47:35,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0 2025-03-04 06:47:35,720 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0' WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted. WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy': FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read HINT: backup.info cannot be opened and is required to perform a backup. HINT: has a stanza-create been performed? ERROR: [075]: no backup set found to restore 2025-03-04 06:47:35,755 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75 pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:35,783 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:35,783 WARNING: Trying again in 5 seconds pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:40,817 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:40,818 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0' 2025-03-04 06:47:40,818 INFO: Removing data directory: /pgdata/pg16 2025-03-04 06:47:45,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0 2025-03-04 06:47:45,702 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0' WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted. WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy': FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read HINT: backup.info cannot be opened and is required to perform a backup. HINT: has a stanza-create been performed? ERROR: [075]: no backup set found to restore 2025-03-04 06:47:45,728 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75 pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:45,755 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:45,755 WARNING: Trying again in 5 seconds pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:50,796 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:50,796 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0' 2025-03-04 06:47:50,797 INFO: Removing data directory: /pgdata/pg16 2025-03-04 06:47:55,703 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0 2025-03-04 06:47:55,704 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'
Additional Information
- cat /etc/patroni/~postgres-operator_cluster.yaml
Generated by postgres-operator. DO NOT EDIT.
Your changes will not be saved.
ctl: cacert: /etc/patroni/~postgres-operator/patroni.ca-roots certfile: /etc/patroni/~postgres-operator/patroni.crt+key insecure: false keyfile: null kubernetes: labels: postgres-operator.crunchydata.com/cluster: hippo namespace: postgres-operator role_label: postgres-operator.crunchydata.com/role scope_label: postgres-operator.crunchydata.com/patroni use_endpoints: true postgresql: authentication: replication: sslcert: /tmp/replication/tls.crt sslkey: /tmp/replication/tls.key sslmode: verify-ca sslrootcert: /tmp/replication/ca.crt username: _crunchyrepl rewind: sslcert: /tmp/replication/tls.crt sslkey: /tmp/replication/tls.key sslmode: verify-ca sslrootcert: /tmp/replication/ca.crt username: _crunchyrepl restapi: cafile: /etc/patroni/~postgres-operator/patroni.ca-roots certfile: /etc/patroni/~postgres-operator/patroni.crt+key keyfile: null verify_client: optional scope: hippo-ha watchdog: mode: "off"
- cat /etc/patroni/~postgres-operator_instance.yaml
Generated by postgres-operator. DO NOT EDIT.
Your changes will not be saved.
kubernetes: {} postgresql: basebackup:
- waldir=/pgdata/pg16_wal create_replica_methods:
- pgbackrest
- basebackup pgbackrest: command: '''bash'' ''-ceu'' ''--'' ''install --directory --mode=0700 "${PGDATA?}" && exec "$@"'' ''-'' ''pgbackrest'' ''restore'' ''--delta'' ''--stanza=db'' ''--repo=1'' ''--link-map=pg_wal=/pgdata/pg16_wal'' ''--type=standby''' keep_data: true no_master: true no_params: true pgpass: /tmp/.pgpass use_unix_socket: true restapi: {} tags: {}