bootstrapFromBackup not working
- I created a timescaledb-single cluster with backups enabled, it ran fine for several months.
- I destroyed the stack completely using tf destroy commands but kept the S3 backups in place.
- I create a new timescaledb-single cluster of the same name/settings with bootstrapFromBackup enabled
- But the new environment is not being restored from the S3 backups
Any ideas?
Here's the clause I've used in my yaml file:
bootstrapFromBackup: enabled: true repo1-path: /old_backups/default/tf-ids-timescale-preprd
Could you share some more information, notably the timescaledb container logs the first time it is started.
Also, it would be very useful if you provide most of the relevant information that is asked for bug reports here: https://github.com/timescale/timescaledb-kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug_report.md
Bug/Issue Description: When attempting to use the "bootstrapFromBackup" clause to restore a new environment from it's previous incarnation, no restore is performed. In my example. I have an EKS qatest environment with backups to S3. I destroyed the stack completely but kept the S3 bucket in tact. I added the bootstrapFromBackup clause to the values.yaml file and rebuilt the same stack from scratch. After the rebuild, the previous backups were not restored. Instead the stack just doesn't come up. It produces errors like "ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720"
Steps to reproduce the behavior, including helm install or helm upgrade commands Create stack with backups to S3 enabled Take backups to S3 Destroy the stack (including pvc volumes and node group) Recreate with the stack with the same parameters except adding the "bootstrapFromBackup" clause in values.yaml
Expected behavior A clear and concise description of what you expected to happen. Expect the db to be restored from the backups in S3 from the previous stack but it doesn't restore.
values.yaml for the 1st stack before destoying it:
replicaCount: 3
replicaLoadBalancer:
enabled: True
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
loadBalancer:
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
patroni:
bootstrap:
method: restore_or_initdb
restore_or_initdb:
command: >
/etc/timescaledb/scripts/restore_or_initdb.sh
--encoding=UTF8
--locale=C.UTF-8
--wal-segsize=256
dcs:
synchronous_mode: true
master_start_timeout: 0
postgresql:
use_slots: false
parameters:
archive_timeout: 1h
checkpoint_timeout: 600s
temp_file_limit: '200GB'
synchronous_commit: remote_apply
max_connections: '${max_connections}'
wal_keep_segments: 200
persistentVolumes:
data:
size: '${ebs_vol_size}'
wal:
enabled: True
size: '${wal_vol_size}'
timescaledbTune:
enabled: true
backup:
enabled: true
pgBackRest:archive-push:
process-max: 4
archive-async: "y"
pgBackRest:archive-get:
process-max: 4
archive-async: "y"
archive-get-queue-max: 2GB
jobs:
- name: fl-dly
type: full
schedule: "18 0 * * *"
- name: inc-hly
type: incr
schedule: "18 1-23 * * *"
values.yaml for the 2nd stack I expected to be restored from the S3 bucket:
replicaCount: 3
replicaLoadBalancer:
enabled: True
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
loadBalancer:
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
patroni:
bootstrap:
method: restore_or_initdb
restore_or_initdb:
command: >
/etc/timescaledb/scripts/restore_or_initdb.sh
--encoding=UTF8
--locale=C.UTF-8
--wal-segsize=256
dcs:
synchronous_mode: true
master_start_timeout: 0
postgresql:
use_slots: false
parameters:
archive_timeout: 1h
checkpoint_timeout: 600s
temp_file_limit: '200GB'
synchronous_commit: remote_apply
max_connections: '${max_connections}'
wal_keep_segments: 200
persistentVolumes:
data:
size: '${ebs_vol_size}'
wal:
enabled: True
size: '${wal_vol_size}'
timescaledbTune:
enabled: true
backup:
enabled: true
pgBackRest:archive-push:
process-max: 4
archive-async: "y"
pgBackRest:archive-get:
process-max: 4
archive-async: "y"
archive-get-queue-max: 2GB
jobs:
- name: fl-dly
type: full
schedule: "18 0 * * *"
- name: inc-hly
type: incr
schedule: "18 1-23 * * *"
bootstrapFromBackup:
enabled: true
repo1-path: /old_backups/default/tf-ids-timescale-qatest
What version of the Chart are you using?
$ helm show chart timescaledb-single
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/ebrodie/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/ebrodie/.kube/config
apiVersion: v1
description: TimescaleDB HA Deployment.
home: https://github.com/timescale/timescaledb-kubernetes
maintainers:
- email: [email protected]
name: TimescaleDB
name: timescaledb-single
sources:
- https://github.com/timescale/timescaledb-kubernetes
- https://github.com/timescale/timescaledb-docker-ha
- https://github.com/zalando/patroni
version: 0.7.1
What is your Kubernetes Environment (for exampe: GKE, EKS, minikube, microk8s):
EKS
Deployment Please share some details of what is in your Kubernetes environment, for example:
$ kubectl get all,secret,configmap,endpoints,pvc -L role -l release=tf-ids-timescale-qatest
NAME READY STATUS RESTARTS AGE ROLE
pod/tf-ids-timescale-qatest-timescaledb-0 2/2 Running 0 21h master
pod/tf-ids-timescale-qatest-timescaledb-1 2/2 Running 0 21h replica
pod/tf-ids-timescale-qatest-timescaledb-2 2/2 Running 0 21h replica
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ROLE
service/tf-ids-timescale-qatest LoadBalancer 172.20.177.253 internal-a9c244bd138094737b6271b30c3910c4-1702785447.us-west-2.elb.amazonaws.com 5432:31095/TCP 21h master
service/tf-ids-timescale-qatest-config ClusterIP None <none> 8008/TCP 21h
service/tf-ids-timescale-qatest-timescaledb-backup ClusterIP 172.20.230.145 <none> 8081/TCP 21h
NAME READY AGE ROLE
statefulset.apps/tf-ids-timescale-qatest-timescaledb 3/3 21h
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ROLE
cronjob.batch/tf-ids-timescale-qatest-timescaledb-fl-dly 18 0 * * * False 0 19h 21h
cronjob.batch/tf-ids-timescale-qatest-timescaledb-inc-hly 18 1-23 * * * False 0 7m37s 21h
NAME DATA AGE ROLE
configmap/tf-ids-timescale-qatest-timescaledb-patroni 1 21h
configmap/tf-ids-timescale-qatest-timescaledb-pgbackrest 1 21h
configmap/tf-ids-timescale-qatest-timescaledb-scripts 8 21h
NAME ENDPOINTS AGE ROLE
endpoints/tf-ids-timescale-qatest 10.167.36.194:5432 21h
endpoints/tf-ids-timescale-qatest-config 10.167.35.230:8008,10.167.36.194:8008,10.167.37.110:8008 21h
endpoints/tf-ids-timescale-qatest-sync <none> 38d
endpoints/tf-ids-timescale-qatest-timescaledb-backup 10.167.36.194:8081 21h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ROLE
persistentvolumeclaim/storage-volume-tf-ids-timescale-qatest-timescaledb-0 Bound pvc-3be16491-bfcb-4039-ac1f-c302464886f9 500Gi RWO gp2 100d
persistentvolumeclaim/storage-volume-tf-ids-timescale-qatest-timescaledb-1 Bound pvc-9a4a0ffa-fa46-4227-934e-d804e6d4ac2d 500Gi RWO gp2 100d
persistentvolumeclaim/storage-volume-tf-ids-timescale-qatest-timescaledb-2 Bound pvc-882b31fc-3745-4eab-820a-b2536b8a8c8a 500Gi RWO gp2 100d
persistentvolumeclaim/wal-volume-tf-ids-timescale-qatest-timescaledb-0 Bound pvc-0317065d-6299-4ba3-836e-0e2d96c8aa02 190Gi RWO gp2 100d
persistentvolumeclaim/wal-volume-tf-ids-timescale-qatest-timescaledb-1 Bound pvc-2e472cff-88a3-4305-a6c4-34612b8e5115 190Gi RWO gp2 100d
persistentvolumeclaim/wal-volume-tf-ids-timescale-qatest-timescaledb-2 Bound pvc-234618c1-be9e-497c-9be1-c02ef2924e21 190Gi RWO gp2 100d
Logs Please provide some details from the logs, see Troubleshooting
After rebuilding the environment with bootstrapFromBackup enabled, these are the errors I receive in the pod logs:
_021-01-21 21:09:30 UTC [150]: [6009ed8a.96-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=[local]
2021-01-21 21:09:30 UTC [150]: [6009ed8a.96-2] postgres@postgres,app=[unknown] [00000] LOG: connection authorized: user=postgres database=postgres
2021-01-21 21:09:31 UTC [150]: [6009ed8a.96-3] postgres@postgres,app=pgBackRest [stanza-create] [00000] LOG: disconnection: session time: 0:00:00.503 user=postgres database=postgres host=[local]
ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
HINT: are you archiving to the correct stanza?
2021-01-21 21:09:40 UTC [95]: [6009ed55.5f-8] @,app= [00000] LOG: archive command failed with exit code 44
2021-01-21 21:09:40 UTC [95]: [6009ed55.5f-9] @,app= [00000] DETAIL: The failed archive command was: /etc/timescaledb/scripts/pgbackrest_archive.sh pg_wal/00000002.history
ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
HINT: are you archiving to the correct stanza?
2021-01-21 21:09:41 UTC [95]: [6009ed55.5f-10] @,app= [00000] LOG: archive command failed with exit code 44
2021-01-21 21:09:41 UTC [95]: [6009ed55.5f-11] @,app= [00000] DETAIL: The failed archive command was: /etc/timescaledb/scripts/pgbackrest_archive.sh pg_wal/00000002.history
ERROR: [044]: PostgreSQL version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
HINT: are you archiving to the correct stanza?
2021-01-21 21:09:42 UTC [95]: [6009ed55.5f-12] @,app= [00000] LOG: archive command failed with exit code 44
2021-01-21 21:09:42 UTC [95]: [6009ed55.5f-13] @,app= [00000] DETAIL: The failed archive command was: /etc/timescaledb/scripts/pgbackrest_archive.sh pg_wal/00000002.history
2021-01-21 21:09:42 UTC [95]: [6009ed55.5f-14] @,app= [01000] WARNING: archiving write-ahead log file "00000002.history" failed too many times, will try again later
2021-01-21 21:09:46 UTC [172]: [6009ed9b.ac-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=[local]
2021-01-21 21:09:46 UTC [172]: [6009ed9b.ac-2] postgres@postgres,app=[unknown] [00000] LOG: connection authorized: user=postgres database=postgres
2021-01-21 21:09:46 UTC [172]: [6009ed9b.ac-3] postgres@postgres,app=pg_isready [00000] LOG: disconnection: session time: 0:00:00.001 user=postgres database=postgres host=[lo_
Could you provide your values.yaml inside code tags? The above is very hard to read.
```yaml
your values:
should look: like this
```
https://www.markdownguide.org/extended-syntax/#fenced-code-blocks
added values.yaml into code tags.
The logs seem to indicate that a new initdb has been performed:
version 11, system-id 6920319053559844912 do not match stanza version 11, system-id 6883169372254236720
That means:
- the location of the backup already contains an older, previously initialized database cluster
- the s3 bucket can be read from the newly spawned pods
To figure out why the restore wasn't done, we need to find the first lines of the first incarnation of the timescaledb container of the Pod.
Do you still have access to the first log lines of the first incarnation of the Pod? If not, could you retry and collect those logs?
I retried, reproducing the issue:
First incarnation log lines:
2021-01-29 00:55:15 - restore_or_initdb - Invoking initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "C.UTF-8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
fixing permissions on existing directory /var/lib/postgresql/wal/pg_wal ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
pg_ctl -D /var/lib/postgresql/data -l logfile start
2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-1] @,app= [00000] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-2] @,app= [00000] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-01-29 00:55:16 UTC [60]: [60135cf4.3c-1] @,app= [00000] LOG: database system was shut down at 2021-01-29 00:55:16 UTC
2021-01-29 00:55:16 UTC [61]: [60135cf4.3d-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=[local]
New incarnation I'd like restored
2021-01-29 00:55:15 - restore_or_initdb - Invoking initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "C.UTF-8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
fixing permissions on existing directory /var/lib/postgresql/wal/pg_wal ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
pg_ctl -D /var/lib/postgresql/data -l logfile start
2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-1] @,app= [00000] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-01-29 00:55:16 UTC [58]: [60135cf4.3a-2] @,app= [00000] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-01-29 00:55:16 UTC [60]: [60135cf4.3c-1] @,app= [00000] LOG: database system was shut down at 2021-01-29 00:55:16 UTC
2021-01-29 00:55:16 UTC [61]: [60135cf4.3d-1] [unknown]@[unknown],app=[unknown] [00000] LOG: connection received: host=[local]
2021-01-29 00:55:16 UTC [61]: [60135cf4.3d-2] postgres@postgres,app=[unknown] [57P03] FATAL: the database system is starting up
pgbackrest log of 2nd incarnation
ERROR: [028]: backup and archive info files exist but do not match the database
HINT: is this the correct stanza?
HINT: did an error occur during stanza-upgrade?
2021-01-29 03:31:57 - bootstrap - Creating pgBackrest stanza
INFO: stanza-create command begin 2.27: --config=/etc/pgbackrest/pgbackrest.conf --log-level-stderr=info --pg1-path=/var/lib/postgresql/data --pg1-port=5432 --pg1-socket-path=/var/run/postgresql --repo1-cipher-type=none --repo1-path=/default/tf-ids-timescale-qatest --repo1-s3-bucket=timescale-db-backup-qatest --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-west-2 --repo1-type=s3 --stanza=poddb
ERROR: [028]: backup and archive info files exist but do not match the database
HINT: is this the correct stanza?
HINT: did an error occur during stanza-upgrade?
INFO: http statistics: objects 1, sessions 1, requests 6, retries 0, closes 0
INFO: stanza-create command end: aborted with exception [028]
@feikesteenbergen We can't get this to work either! We have been running for quite some time with version 0.5.4 of the chart in a namespace called cluster-svc, with backups to S3 enabled. Now, we want to test restoring the database to a new deployment in a different namespace, so we updated to version 0.7.0 of the chart, with the following values.yaml:
bootstrapFromBackup:
enabled: true
repo1-path: /cluster-svc/timescaledb-single
secretName: timescaledb-single-pgbackrest
According to the documentation, this would perform a restore (=clone) of the database to the new deployment. However, the pods log shows:
/bin/bash: line 24: unset: `[global:archive-get]': not a valid identifier
2021-02-11 14:01:49.207 P00 INFO: restore command begin 2.31: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=23-263a932d --force --link-all --log-level-console=detail --pg1-path=/data/postgre/data --process-max=4 --repo1-cipher-type=none --repo1-path=/cluster-svc-test/timescaledb-test --repo1-s3-bucket=k8s-dev-database-backups --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=eu-north-1 --repo1-type=s3 --stanza=poddb
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/data/postgre/data' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
ERROR: [055]: unable to load info file '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info' or '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info.copy':
FileMissingError: unable to open '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info': No such file or directory
FileMissingError: unable to open '/cluster-svc-test/timescaledb-test/backup/poddb/backup.info.copy': No such file or directory
HINT: backup.info cannot be opened and is required to perform a backup.
HINT: has a stanza-create been performed?
So, apparently, you cannot restore a database to a different deployment, as the error above indicates that there must be a previously created backup stanza for the new deployment?
To try to get around this problem, we created a new deployment without bootstrap from backup enabled, just to make sure that the required stanza got created and the S3 folders existed. Then, we deleted the deployment and the persistent volumes, and performed a new deployment with bootstrap from backup enabled, and then a restore was being performed, as indicated by the pods logs:
2021-02-11 17:47:29.775 P00 INFO: restore command begin 2.31: --config=/etc/pgbackrest/pgbackrest.conf --delta --exec-id=23-b0ee05ed --force --link-all --log-level-console=detail --pg1-path=/data/postgre/data --process-max=4 --repo1-cipher-type=none --repo1-path=/cluster-svc-test/timescaledb-test --repo1-s3-bucket=k8s-dev-database-backups --repo1-s3-endpoint=s3.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=eu-north-1 --repo1-type=s3 --stanza=poddb
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/data/postgre/data' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2021-02-11 17:47:29.889 P00 INFO: restore backup set 20210211-173153F
However, the problem is that what was restored was not the/cluster-svc/timescaledb-singlebackup, but the backup from the new deployment /cluster-svc-test/timescaledb-test that was created just to be able to perform a restore. It seems as if this value is coming from the pgbackrest config map. What do we need to change to get the restore to work?
I've bumped into the same issue