helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

bootstrapFromBackup not working

Open ebrodie opened this issue 4 years ago • 9 comments

  1. I created a timescaledb-single cluster with backups enabled, it ran fine for several months.
  2. I destroyed the stack completely using tf destroy commands but kept the S3 backups in place.
  3. I create a new timescaledb-single cluster of the same name/settings with bootstrapFromBackup enabled
  4. But the new environment is not being restored from the S3 backups

Any ideas?

ebrodie avatar Jan 19 '21 19:01 ebrodie

Here's the clause I've used in my yaml file:

bootstrapFromBackup: enabled: true repo1-path: /old_backups/default/tf-ids-timescale-preprd

ebrodie avatar Jan 19 '21 19:01 ebrodie

Could you share some more information, notably the timescaledb container logs the first time it is started.

Also, it would be very useful if you provide most of the relevant information that is asked for bug reports here: https://github.com/timescale/timescaledb-kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug_report.md

feikesteenbergen avatar Jan 20 '21 10:01 feikesteenbergen

Bug/Issue Description: When attempting to use the "bootstrapFromBackup" clause to restore a new environment from it's previous incarnation, no restore is performed. In my example. I have an EKS qatest environment with backups to S3. I destroyed the stack completely but kept the S3 bucket in tact. I added the bootstrapFromBackup clause to the values.yaml file and rebuilt the same stack from scratch. After the rebuild, the previous backups were not restored. Instead the stack just doesn't come up. It produces errors like "ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720"

Steps to reproduce the behavior, including helm install or helm upgrade commands Create stack with backups to S3 enabled Take backups to S3 Destroy the stack (including pvc volumes and node group) Recreate with the stack with the same parameters except adding the "bootstrapFromBackup" clause in values.yaml

Expected behavior A clear and concise description of what you expected to happen. Expect the db to be restored from the backups in S3 from the previous stack but it doesn't restore.

values.yaml for the 1st stack before destoying it:

replicaCount: 3

replicaLoadBalancer:
  enabled: True
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
loadBalancer:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"

patroni:
  bootstrap:
    method: restore_or_initdb
    restore_or_initdb:
      command: >
        /etc/timescaledb/scripts/restore_or_initdb.sh
        --encoding=UTF8
        --locale=C.UTF-8
        --wal-segsize=256
    dcs:
      synchronous_mode: true
      master_start_timeout: 0
      postgresql:
        use_slots: false
        parameters:
          archive_timeout: 1h
          checkpoint_timeout: 600s
          temp_file_limit: '200GB'
          synchronous_commit: remote_apply
          max_connections: '${max_connections}'
          wal_keep_segments: 200

persistentVolumes:
  data:
    size: '${ebs_vol_size}'
  wal:
    enabled: True
    size: '${wal_vol_size}'

timescaledbTune:
  enabled: true

backup:
  enabled: true
  pgBackRest:archive-push:
    process-max: 4
    archive-async: "y"
  pgBackRest:archive-get:
    process-max: 4
    archive-async: "y"
    archive-get-queue-max: 2GB
  jobs:
    - name: fl-dly
      type: full
      schedule: "18 0 * * *"
    - name: inc-hly
      type: incr
      schedule: "18 1-23 * * *"

values.yaml for the 2nd stack I expected to be restored from the S3 bucket:

replicaCount: 3

replicaLoadBalancer:
  enabled: True
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
loadBalancer:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"

patroni:
  bootstrap:
    method: restore_or_initdb
    restore_or_initdb:
      command: >
        /etc/timescaledb/scripts/restore_or_initdb.sh
        --encoding=UTF8
        --locale=C.UTF-8
        --wal-segsize=256
    dcs:
      synchronous_mode: true
      master_start_timeout: 0
      postgresql:
        use_slots: false
        parameters:
          archive_timeout: 1h
          checkpoint_timeout: 600s
          temp_file_limit: '200GB'
          synchronous_commit: remote_apply
          max_connections: '${max_connections}'
          wal_keep_segments: 200

persistentVolumes:
  data:
    size: '${ebs_vol_size}'
  wal:
    enabled: True
    size: '${wal_vol_size}'

timescaledbTune:
  enabled: true

backup:
  enabled: true
  pgBackRest:archive-push:
    process-max: 4
    archive-async: "y"
  pgBackRest:archive-get:
    process-max: 4
    archive-async: "y"
    archive-get-queue-max: 2GB
  jobs:
    - name: fl-dly
      type: full
      schedule: "18 0 * * *"
    - name: inc-hly
      type: incr
      schedule: "18 1-23 * * *"

bootstrapFromBackup:
  enabled: true
  repo1-path: /old_backups/default/tf-ids-timescale-qatest

What version of the Chart are you using?

$ helm show chart timescaledb-single
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/ebrodie/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/ebrodie/.kube/config
apiVersion: v1
description: TimescaleDB HA Deployment.
home: https://github.com/timescale/timescaledb-kubernetes
maintainers:
- email: [email protected]
  name: TimescaleDB
name: timescaledb-single
sources:
- https://github.com/timescale/timescaledb-kubernetes
- https://github.com/timescale/timescaledb-docker-ha
- https://github.com/zalando/patroni
version: 0.7.1

What is your Kubernetes Environment (for exampe: GKE, EKS, minikube, microk8s):

EKS

Deployment Please share some details of what is in your Kubernetes environment, for example:

$ kubectl get all,secret,configmap,endpoints,pvc -L role -l release=tf-ids-timescale-qatest
NAME                                        READY   STATUS    RESTARTS   AGE   ROLE
pod/tf-ids-timescale-qatest-timescaledb-0   2/2     Running   0          21h   master
pod/tf-ids-timescale-qatest-timescaledb-1   2/2     Running   0          21h   replica
pod/tf-ids-timescale-qatest-timescaledb-2   2/2     Running   0          21h   replica

NAME                                                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                                        PORT(S)          AGE   ROLE
service/tf-ids-timescale-qatest                      LoadBalancer   172.20.177.253   internal-a9c244bd138094737b6271b30c3910c4-1702785447.us-west-2.elb.amazonaws.com   5432:31095/TCP   21h   master
service/tf-ids-timescale-qatest-config               ClusterIP      None             <none>                                                                             8008/TCP         21h
service/tf-ids-timescale-qatest-timescaledb-backup   ClusterIP      172.20.230.145   <none>                                                                             8081/TCP         21h

NAME                                                   READY   AGE   ROLE
statefulset.apps/tf-ids-timescale-qatest-timescaledb   3/3     21h

NAME                                                        SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE   ROLE
cronjob.batch/tf-ids-timescale-qatest-timescaledb-fl-dly    18 0 * * *      False     0        19h             21h
cronjob.batch/tf-ids-timescale-qatest-timescaledb-inc-hly   18 1-23 * * *   False     0        7m37s           21h

NAME                                                       DATA   AGE   ROLE
configmap/tf-ids-timescale-qatest-timescaledb-patroni      1      21h
configmap/tf-ids-timescale-qatest-timescaledb-pgbackrest   1      21h
configmap/tf-ids-timescale-qatest-timescaledb-scripts      8      21h

NAME                                                   ENDPOINTS                                                  AGE   ROLE
endpoints/tf-ids-timescale-qatest                      10.167.36.194:5432                                         21h
endpoints/tf-ids-timescale-qatest-config               10.167.35.230:8008,10.167.36.194:8008,10.167.37.110:8008   21h
endpoints/tf-ids-timescale-qatest-sync                 <none>                                                     38d
endpoints/tf-ids-timescale-qatest-timescaledb-backup   10.167.36.194:8081                                         21h

NAME                                                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE    ROLE
persistentvolumeclaim/storage-volume-tf-ids-timescale-qatest-timescaledb-0   Bound    pvc-3be16491-bfcb-4039-ac1f-c302464886f9   500Gi      RWO            gp2            100d
persistentvolumeclaim/storage-volume-tf-ids-timescale-qatest-timescaledb-1   Bound    pvc-9a4a0ffa-fa46-4227-934e-d804e6d4ac2d   500Gi      RWO            gp2            100d
persistentvolumeclaim/storage-volume-tf-ids-timescale-qatest-timescaledb-2   Bound    pvc-882b31fc-3745-4eab-820a-b2536b8a8c8a   500Gi      RWO            gp2            100d
persistentvolumeclaim/wal-volume-tf-ids-timescale-qatest-timescaledb-0       Bound    pvc-0317065d-6299-4ba3-836e-0e2d96c8aa02   190Gi      RWO            gp2            100d
persistentvolumeclaim/wal-volume-tf-ids-timescale-qatest-timescaledb-1       Bound    pvc-2e472cff-88a3-4305-a6c4-34612b8e5115   190Gi      RWO            gp2            100d
persistentvolumeclaim/wal-volume-tf-ids-timescale-qatest-timescaledb-2       Bound    pvc-234618c1-be9e-497c-9be1-c02ef2924e21   190Gi      RWO            gp2            100d

Logs Please provide some details from the logs, see Troubleshooting

After rebuilding the environment with bootstrapFromBackup enabled, these are the errors I receive in the pod logs:
_021-01-21 21:09:30 UTC [150]: [6009ed8a.96-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-01-21 21:09:30 UTC [150]: [6009ed8a.96-2] postgres@postgres,app=[unknown] [00000] LOG:  connection authorized: user=postgres database=postgres
2021-01-21 21:09:31 UTC [150]: [6009ed8a.96-3] postgres@postgres,app=pgBackRest [stanza-create] [00000] LOG:  disconnection: session time: 0:00:00.503 user=postgres database=postgres host=[local]
ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
       HINT: are you archiving to the correct stanza?
2021-01-21 21:09:40 UTC [95]: [6009ed55.5f-8] @,app= [00000] LOG:  archive command failed with exit code 44
2021-01-21 21:09:40 UTC [95]: [6009ed55.5f-9] @,app= [00000] DETAIL:  The failed archive command was: /etc/timescaledb/scripts/pgbackrest_archive.sh pg_wal/00000002.history
ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
       HINT: are you archiving to the correct stanza?
2021-01-21 21:09:41 UTC [95]: [6009ed55.5f-10] @,app= [00000] LOG:  archive command failed with exit code 44
2021-01-21 21:09:41 UTC [95]: [6009ed55.5f-11] @,app= [00000] DETAIL:  The failed archive command was: /etc/timescaledb/scripts/pgbackrest_archive.sh pg_wal/00000002.history
ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
       HINT: are you archiving to the correct stanza?
2021-01-21 21:09:42 UTC [95]: [6009ed55.5f-12] @,app= [00000] LOG:  archive command failed with exit code 44
2021-01-21 21:09:42 UTC [95]: [6009ed55.5f-13] @,app= [00000] DETAIL:  The failed archive command was: /etc/timescaledb/scripts/pgbackrest_archive.sh pg_wal/00000002.history
2021-01-21 21:09:42 UTC [95]: [6009ed55.5f-14] @,app= [01000] WARNING:  archiving write-ahead log file "00000002.history" failed too many times, will try again later
2021-01-21 21:09:46 UTC [172]: [6009ed9b.ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-01-21 21:09:46 UTC [172]: [6009ed9b.ac-2] postgres@postgres,app=[unknown] [00000] LOG:  connection authorized: user=postgres database=postgres
2021-01-21 21:09:46 UTC [172]: [6009ed9b.ac-3] postgres@postgres,app=pg_isready [00000] LOG:  disconnection: session time: 0:00:00.001 user=postgres database=postgres host=[lo_

ebrodie avatar Jan 21 '21 21:01 ebrodie

Could you provide your values.yaml inside code tags? The above is very hard to read.

```yaml
your values:
   should look: like this
```

https://www.markdownguide.org/extended-syntax/#fenced-code-blocks

feikesteenbergen avatar Jan 23 '21 10:01 feikesteenbergen

added values.yaml into code tags.

ebrodie avatar Jan 25 '21 16:01 ebrodie

The logs seem to indicate that a new initdb has been performed:

version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720

That means:

  • the location of the backup already contains an older, previously initialized database cluster
  • the s3 bucket can be read from the newly spawned pods

To figure out why the restore wasn't done, we need to find the first lines of the first incarnation of the timescaledb container of the Pod.

Do you still have access to the first log lines of the first incarnation of the Pod? If not, could you retry and collect those logs?

feikesteenbergen avatar Jan 26 '21 11:01 feikesteenbergen

I retried, reproducing the issue:

First incarnation log lines:

2021-01-29 00:55:15 - restore_or_initdb - Invoking initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
fixing permissions on existing directory /var/lib/postgresql/wal/pg_wal ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/postgresql/data -l logfile start

2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-1] @,app= [00000] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-2] @,app= [00000] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-01-29 00:55:16 UTC [60]: [60135cf4.3c-1] @,app= [00000] LOG:  database system was shut down at 2021-01-29 00:55:16 UTC
2021-01-29 00:55:16 UTC [61]: [60135cf4.3d-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]

New incarnation I'd like restored

2021-01-29 00:55:15 - restore_or_initdb - Invoking initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
fixing permissions on existing directory /var/lib/postgresql/wal/pg_wal ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/postgresql/data -l logfile start

2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-1] @,app= [00000] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-2] @,app= [00000] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-01-29 00:55:16 UTC [60]: [60135cf4.3c-1] @,app= [00000] LOG:  database system was shut down at 2021-01-29 00:55:16 UTC
2021-01-29 00:55:16 UTC [61]: [60135cf4.3d-1] [unknown]@[unknown],app=[unknown] [00000] LOG:  connection received: host=[local]
2021-01-29 00:55:16 UTC [61]: [60135cf4.3d-2] postgres@postgres,app=[unknown] [57P03] FATAL:  the database system is starting up

pgbackrest log of 2nd incarnation

ERROR: [028]: backup and archive info files exist but do not match the database
       HINT: is this the correct stanza?
       HINT: did an error occur during stanza-upgrade?
2021-01-29 03:31:57 - bootstrap - Creating pgBackrest stanza
INFO: stanza-create command begin 2.27: --config=/etc/pgbackrest/pgbackrest.conf --log-level-stderr=info --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --repo1-cipher-type=none --repo1-path=/default/tf-ids-timescale-qatest --repo1-s3-bucket=timescale-db-backup-qatest --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-west-2 --repo1-type=s3 --stanza=poddb
ERROR: [028]: backup and archive info files exist but do not match the database
       HINT: is this the correct stanza?
       HINT: did an error occur during stanza-upgrade?
INFO: http statistics: objects 1, sessions 1, requests 6, retries 0, closes 0
INFO: stanza-create command end: aborted with exception [028]

ebrodie avatar Jan 29 '21 03:01 ebrodie

@feikesteenbergen We can't get this to work either! We have been running for quite some time with version 0.5.4 of the chart in a namespace called cluster-svc, with backups to S3 enabled. Now, we want to test restoring the database to a new deployment in a different namespace, so we updated to version 0.7.0 of the chart, with the following values.yaml:

bootstrapFromBackup:
    enabled: true
    repo1-path: /cluster-svc/timescaledb-single
    secretName: timescaledb-single-pgbackrest

According to the documentation, this would perform a restore (=clone) of the database to the new deployment. However, the pods log shows:

/bin/bash: line 24: unset: `[global:archive-get]': not a valid identifier
2021-02-11 14:01:49.207 P00   INFO: restore command begin 2.31: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=23-263a932d --force --link-all --log-level-console=detail --pg1-path=/data/postgre/data --process-max=4 --repo1-cipher-type=none --repo1-path=/cluster-svc-test/timescaledb-test --repo1-s3-bucket=k8s-dev-database-backups --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=eu-north-1 --repo1-type=s3 --stanza=poddb
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/data/postgre/data' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
ERROR: [055]: unable to load info file '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info' or '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info.copy':
       FileMissingError: unable to open '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info': No such file or directory
       FileMissingError: unable to open '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info.copy': No such file or directory
       HINT: backup.info cannot be opened and is required to perform a backup.
       HINT: has a stanza-create been performed?

So, apparently, you cannot restore a database to a different deployment, as the error above indicates that there must be a previously created backup stanza for the new deployment?

To try to get around this problem, we created a new deployment without bootstrap from backup enabled, just to make sure that the required stanza got created and the S3 folders existed. Then, we deleted the deployment and the persistent volumes, and performed a new deployment with bootstrap from backup enabled, and then a restore was being performed, as indicated by the pods logs:

2021-02-11 17:47:29.775 P00   INFO: restore command begin 2.31: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=23-b0ee05ed --force --link-all --log-level-console=detail --pg1-path=/data/postgre/data --process-max=4 --repo1-cipher-type=none --repo1-path=/cluster-svc-test/timescaledb-test --repo1-s3-bucket=k8s-dev-database-backups --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=eu-north-1 --repo1-type=s3 --stanza=poddb
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/data/postgre/data' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2021-02-11 17:47:29.889 P00   INFO: restore backup set 20210211-173153F

However, the problem is that what was restored was not the/cluster-svc/timescaledb-singlebackup, but the backup from the new deployment /cluster-svc-test/timescaledb-test that was created just to be able to perform a restore. It seems as if this value is coming from the pgbackrest config map. What do we need to change to get the restore to work?

landerss1 avatar Feb 11 '21 18:02 landerss1

I've bumped into the same issue

4nte avatar Dec 09 '21 13:12 4nte