postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

postgresql cannot boot because of the database backup will not be usable without all the WAL segments

Open godzilla-s opened this issue 2 years ago • 3 comments

Please, answer some short questions which should help us to understand your problem / question better?

  • Which image of the operator are you using? registry.opensource.zalan.do/acid/postgres-operator:v1.8.2
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? Bare Metal K8s
  • Are you running Postgres Operator in production? no
  • Type of issue? feature request

Some general remarks when posting a bug report:

  • Please, check the operator, pod (Patroni) and postgresql logs first. When copy-pasting many log lines please do it in a separate GitHub gist together with your Postgres CRD and configuration manifest.
  • If you feel this issue might be more related to the Spilo docker image or Patroni, consider opening issues in the respective repos.

Hi, I create a postgresql instance with follow config:

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-test
spec:
  teamId: "acid"
  numberOfInstances: 2
  enableMasterLoadBalancer: true
  postgresql:
    version: "14"
    parameters:
      wal_level: "archive"
      # archive_command: 'sh /script/backup.sh'
  volume:
    size: 1Gi
  additionalVolumes:
    - name: backup-script
      mountPath: /tmp
      targetContainers:
        - all
      volumeSource:
        configMap:
          name: backup-script
  env:
    - name: AWS_ACCESS_KEY_ID
      value: minioadmin
    - name: AWS_SECRET_ACCESS_KEY
      value: minioadmin
    - name: PGUSER
      valueFrom:
        secretKeyRef:
          key: username
          name: postgres.acid-test.credentials.postgresql.acid.zalan.do
    - name: PGPASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: postgres.acid-test.credentials.postgresql.acid.zalan.do
    - name: AWS_ENDPOINT
      value: http://10.10.15.114:9001
    - name: AWS_S3_FORCE_PATH_STYLE
      value: "true"
    - name: USE_WALG_BACKUP
      value: "true"
    - name: WALG_S3_PREFIX
      value: s3://test
  resources:
    requests:
      cpu: 10m
      memory: 100Mi
    limits:
      cpu: 1000m
      memory: 1Gi

after commit to k8s , it can't create , like this:

zuvakin@linux:~/projects/k8s/postgresql$ kubectl get pod -n test
NAME                               READY   STATUS    RESTARTS   AGE
acid-test-1                        0/1     Running   0          9m1s
acid-test-0                        1/1     Running   0          9m11s

acid-test-1 can't boot nornally, and logs as below:

2022-07-26 07:51:21,425 - bootstrapping - INFO - Configuring standby-cluster
2022-07-26 07:51:21,425 - bootstrapping - INFO - Configuring patroni
2022-07-26 07:51:21,431 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-07-26 07:51:22,635 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2022-07-26 07:51:22,644 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-07-26 07:51:22,646 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:51:22,657 INFO: trying to bootstrap from leader 'acid-test-0'
2022-07-26 07:51:24,798 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:51:24,811 INFO: bootstrap from leader 'acid-test-0' in progress
Can not find any backups
2022-07-26 07:51:24,832 ERROR: Error creating replica using method wal_e: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh exited with code=1
2022-07-26 07:51:34,837 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:51:34,837 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:51:44,788 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:51:44,788 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:51:54,795 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:51:54,795 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:52:04,789 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:52:04,789 INFO: bootstrap from leader 'acid-test-0' in progress
1024+0 records in
1024+0 records out
16777216 bytes (17 MB, 16 MiB) copied, 0.0341451 s, 491 MB/s
NOTICE:  base backup done, waiting for required WAL segments to be archived
2022-07-26 07:52:14,832 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:52:14,832 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:52:24,795 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:52:24,795 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:52:34,793 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:52:34,793 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:52:44,795 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:52:44,796 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:52:54,787 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:52:54,787 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:53:04,791 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:53:04,791 INFO: bootstrap from leader 'acid-test-0' in progress
WARNING:  still waiting for all required WAL segments to be archived (60 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  You can safely cancel this backup, but the database backup will not be usable without all the WAL segments.
2022-07-26 07:53:14,835 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:53:14,836 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:53:24,788 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:53:24,788 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:53:34,832 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:53:34,832 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:53:44,792 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:53:44,793 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:53:54,825 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:53:54,825 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:54:04,783 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:54:04,783 INFO: bootstrap from leader 'acid-test-0' in progress
WARNING:  still waiting for all required WAL segments to be archived (120 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  You can safely cancel this backup, but the database backup will not be usable without all the WAL segments.
2022-07-26 07:54:14,799 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:54:14,799 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:54:24,831 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:54:24,831 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:54:34,800 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:54:34,800 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:54:44,798 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:54:44,798 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:54:54,828 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:54:54,828 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:55:04,798 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:55:04,798 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:55:14,801 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:55:14,801 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:55:24,824 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:55:24,824 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:55:34,781 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:55:34,781 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:55:44,784 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:55:44,785 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:55:54,828 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:55:54,828 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:56:04,795 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:56:04,795 INFO: bootstrap from leader 'acid-test-0' in progress
WARNING:  still waiting for all required WAL segments to be archived (240 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  You can safely cancel this backup, but the database backup will not be usable without all the WAL segments.
2022-07-26 07:56:14,789 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:56:14,789 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:56:24,825 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:56:24,825 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:56:34,797 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:56:34,797 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:56:44,793 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:56:44,793 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:56:54,827 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:56:54,827 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:57:04,793 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:57:04,793 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:57:14,794 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:57:14,794 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:57:24,790 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:57:24,791 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:57:34,836 INFO: Lock owner: acid-test-0; I am acid-test-1
2022-07-26 07:57:34,836 INFO: bootstrap from leader 'acid-test-0' in progress
2022-07-26 07:57:44,793 INFO: Lock owner: acid-test-0; I am acid-test-1

it appear ' Check that your archive_command is executing properly. You can safely cancel this backup, but the database backup will not be usable without all the WAL segments'

what's wrong with it , does the CR correct?

godzilla-s avatar Jul 26 '22 08:07 godzilla-s

It is not a problem of postgre-operator or spilo. They only operate with the configuration you provide. If something is broken - you have to figure out what and fix it.

CyberDem0n avatar Aug 01 '22 07:08 CyberDem0n

@godzilla-s Just stumbling over it while checking for a different backup-related issue: does the configuration style mentioned in this other issue answer help? https://github.com/zalando/postgres-operator/issues/1965#issuecomment-1185387027

philippfreyer avatar Aug 04 '22 09:08 philippfreyer

@godzilla-s you need to check logs of pod acid-test-0 This trouble is occured because acid-test-0 can't connect to S3 storage for make base backup and archive wal files. And acid-test-1 is waiting for all wal files from S3 storage. When I investigated this trouble I entered into pod acid-test-0 and run a script for making backup: envdir "/run/etc/wal-e.d/env" /scripts/postgres_backup.sh "/home/postgres/pgdata/pgroot/data" and checked an errors.

There is example of good logs of master pod

GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
RESET
2022-08-09 10:03:53,205 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:04:03,246 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:04:13,204 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:04:23,228 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:04:33,187 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock

2022-08-09 10:04:37.074 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
2022-08-09 10:04:37.148 - /scripts/postgres_backup.sh - producing a new backup
INFO: 2022/08/09 10:04:37.254145 Calling pg_start_backup()
INFO: 2022/08/09 10:04:37.779208 Starting a new tar bundle
INFO: 2022/08/09 10:04:37.779243 Walking ...
INFO: 2022/08/09 10:04:37.780051 Starting part 1 ...
2022-08-09 10:04:38.075 38 LOG Starting pgqd 3.3
2022-08-09 10:04:38.075 38 LOG auto-detecting dbs ...
INFO: 2022/08/09 10:04:39.741977 Packing ...
INFO: 2022/08/09 10:04:39.744801 Finished writing part 1.
INFO: 2022/08/09 10:04:39.856157 Starting part 2 ...
INFO: 2022/08/09 10:04:39.856185 /global/pg_control
INFO: 2022/08/09 10:04:39.858861 Finished writing part 2.
INFO: 2022/08/09 10:04:39.858872 Calling pg_stop_backup()
INFO: 2022/08/09 10:04:41.166761 Starting part 3 ...
INFO: 2022/08/09 10:04:41.180452 backup_label
INFO: 2022/08/09 10:04:41.180472 tablespace_map
INFO: 2022/08/09 10:04:41.180534 Finished writing part 3.
INFO: 2022/08/09 10:04:41.210225 Wrote backup with name base_00000007000000000000001D

2022-08-09 10:04:43,251 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:04:53,205 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:05:03,185 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock
2022-08-09 10:05:08.104 38 LOG {ticks: 0, maint: 0, retry: 0}
2022-08-09 10:05:13,227 INFO: no action. I am (id567-pgsql-92f50bc3-1), the leader with the lock

leader's pod botstrap is finished only after S3 backup is done

FactorT avatar Aug 09 '22 13:08 FactorT

something wrong with s3 server, thanks all

godzilla-s avatar Sep 19 '22 07:09 godzilla-s

it worked now

godzilla-s avatar Sep 19 '22 07:09 godzilla-s