postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

replica node db failed to start after removing its $PGDATA

Open lynlvcheng opened this issue 10 months ago • 0 comments

Overview

Create a Postgres Cluster with 3 replicas. The replica node failed to start after removing $PGDATA

Environment

Please provide the following details:

  • Platform: (Kubernetes)
  • Platform Version: (v1.23.7)
  • PGO Image Tag: (e.g. ubi8-5.6.0-0)
  • Postgres Version (e.g. 16)
  • Storage: (hostpath)

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. run 'kubectl apply -k kustomize/postgres'

  2. run 'kubectl get pv | grep repo'

  3. run 'kubectl get pv pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281 -o yaml'

    below is the result: ............. the hostpath is: hostPath: path: /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1 ..............

  4. delete all under repo's hostpath: run 'cd /var/lib/data/local-path-provisioner/pvc-814fd0cb-ff6b-42bb-94ef-d068c602b281_postgres-operator_hippo-repo1' run 'rm -rf *'

  5. run 'kubectl get pods -n postgres-operator' below is the result: NAME READY STATUS RESTARTS AGE hippo-backup-xt6v-k445d 0/1 Completed 0 36m hippo-instance1-24sl-0 4/4 Running 0 21m hippo-instance1-9n7g-0 4/4 Running 0 14m hippo-instance1-l7h5-0 3/4 Running 0 19m hippo-repo-host-0 2/2 Running 0 36m pgo-7784d579df-glpz6 1/1 Running 0 47m

  6. find one replica nodes run 'kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh' below is the result: [root@k8s-master postgres]# kubectl exec -it hippo-instance1-l7h5-0 -n postgres-operator -c database -- /bin/sh sh-4.4$ patronictl list

    • Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------+-----------------------------------+---------+-----------+----+-----------+ | hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | | | hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 | | hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown | +------------------------+-----------------------------------+---------+-----------+----+-----------+ sh-4.4$
  7. go to the replica's pv, and remove $PGDATA

    1. run 'kubectl get pv | grep l7h5' below is the result: pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 1Gi RWO Delete Bound postgres-operator/hippo-instance1-l7h5-pgdata hostpath 39m

    2)run 'kubectl get pv pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5 -o yaml' below is the result: ................ hostPath: path: /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata 3) run 'cd /var/lib/data/local-path-provisioner/pvc-45cf8f0e-3bf7-498e-b89b-454ca3aed1e5_postgres-operator_hippo-instance1-l7h5-pgdata‘ 4) run 'rm -rf pg16'

  8. check postgres cluster status run: patronictl list below is the result: sh-4.4$ patronictl list

  • Cluster: hippo-ha (7477835052125798486) -------------------+---------+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +------------------------+-----------------------------------+---------+-----------+----+-----------+ | hippo-instance1-24sl-0 | hippo-instance1-24sl-0.hippo-pods | Leader | running | 5 | | | hippo-instance1-9n7g-0 | hippo-instance1-9n7g-0.hippo-pods | Replica | streaming | 5 | 0 | | hippo-instance1-l7h5-0 | hippo-instance1-l7h5-0.hippo-pods | Replica | stopped | | unknown | +------------------------+-----------------------------------+---------+-----------+----+-----------+ sh-4.4$

EXPECTED

  1. the member 'hippo-instance1-l7h5-0' state shoud be streaming.

ACTUAL

  1. the member 'hippo-instance1-l7h5-0' state shoud be stopped.

Logs

2025-03-04 06:47:35,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0 2025-03-04 06:47:35,720 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0' WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted. WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy': FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read HINT: backup.info cannot be opened and is required to perform a backup. HINT: has a stanza-create been performed? ERROR: [075]: no backup set found to restore 2025-03-04 06:47:35,755 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75 pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:35,783 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:35,783 WARNING: Trying again in 5 seconds pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:40,817 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:40,818 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0' 2025-03-04 06:47:40,818 INFO: Removing data directory: /pgdata/pg16 2025-03-04 06:47:45,702 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0 2025-03-04 06:47:45,702 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0' WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg16' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted. WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/repo1/backup/db/backup.info' or '/pgbackrest/repo1/backup/db/backup.info.copy': FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info' for read FileMissingError: raised from remote-0 tls protocol on 'hippo-repo-host-0.hippo-pods.postgres-operator.svc.cluster.local.': unable to open missing file '/pgbackrest/repo1/backup/db/backup.info.copy' for read HINT: backup.info cannot be opened and is required to perform a backup. HINT: has a stanza-create been performed? ERROR: [075]: no backup set found to restore 2025-03-04 06:47:45,728 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg16_wal' '--type=standby' exited with code=75 pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:45,755 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:45,755 WARNING: Trying again in 5 seconds pg_basebackup: error: directory "/pgdata/pg16_wal" exists but is not empty pg_basebackup: removing contents of data directory "/pgdata/pg16" 2025-03-04 06:47:50,796 ERROR: Error when fetching backup: pg_basebackup exited with code=1 2025-03-04 06:47:50,796 ERROR: failed to bootstrap from leader 'hippo-instance1-24sl-0' 2025-03-04 06:47:50,797 INFO: Removing data directory: /pgdata/pg16 2025-03-04 06:47:55,703 INFO: Lock owner: hippo-instance1-24sl-0; I am hippo-instance1-l7h5-0 2025-03-04 06:47:55,704 INFO: trying to bootstrap from leader 'hippo-instance1-24sl-0'

Additional Information

  1. cat /etc/patroni/~postgres-operator_cluster.yaml

Generated by postgres-operator. DO NOT EDIT.

Your changes will not be saved.

ctl: cacert: /etc/patroni/~postgres-operator/patroni.ca-roots certfile: /etc/patroni/~postgres-operator/patroni.crt+key insecure: false keyfile: null kubernetes: labels: postgres-operator.crunchydata.com/cluster: hippo namespace: postgres-operator role_label: postgres-operator.crunchydata.com/role scope_label: postgres-operator.crunchydata.com/patroni use_endpoints: true postgresql: authentication: replication: sslcert: /tmp/replication/tls.crt sslkey: /tmp/replication/tls.key sslmode: verify-ca sslrootcert: /tmp/replication/ca.crt username: _crunchyrepl rewind: sslcert: /tmp/replication/tls.crt sslkey: /tmp/replication/tls.key sslmode: verify-ca sslrootcert: /tmp/replication/ca.crt username: _crunchyrepl restapi: cafile: /etc/patroni/~postgres-operator/patroni.ca-roots certfile: /etc/patroni/~postgres-operator/patroni.crt+key keyfile: null verify_client: optional scope: hippo-ha watchdog: mode: "off"

  1. cat /etc/patroni/~postgres-operator_instance.yaml

Generated by postgres-operator. DO NOT EDIT.

Your changes will not be saved.

kubernetes: {} postgresql: basebackup:

  • waldir=/pgdata/pg16_wal create_replica_methods:
  • pgbackrest
  • basebackup pgbackrest: command: '''bash'' ''-ceu'' ''--'' ''install --directory --mode=0700 "${PGDATA?}" && exec "$@"'' ''-'' ''pgbackrest'' ''restore'' ''--delta'' ''--stanza=db'' ''--repo=1'' ''--link-map=pg_wal=/pgdata/pg16_wal'' ''--type=standby''' keep_data: true no_master: true no_params: true pgpass: /tmp/.pgpass use_unix_socket: true restapi: {} tags: {}

lynlvcheng avatar Mar 04 '25 08:03 lynlvcheng