postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Backups fail due to stanza creation error in 5.0.0

Open flotothemoon opened this issue 3 years ago • 6 comments

Overview

Setting up backups in my cluster fails with "unable to create stanza" errors.

Environment

  • Platform: Kubernetes
  • Platform Version: 1.19
  • PGO Image Tag: ubi8-5.0.0-0
  • Postgres Version: 13
  • Storage: gp2

Steps to Reproduce

REPRO

Follow examples in the tutorial to setup the operator and a cluster on Kubernetes.

EXPECTED

  1. Backups and backup repositories are correctly configured and executed

ACTUAL

  1. Backups fail due to "unable to create stanza" error (see below for details).

Logs

The exact error I get in the operator logs is:

time="2021-07-21T19:14:18Z" level=error msg="unable to create stanza" error="command terminated with exit code 47: ERROR: [047]: unable to create path '/pgbackrest/default': 
[30] Read-only file system\n" file="internal/controller/postgrescluster/pgbackrest.go:2324" func="postgrescluster.(*Reconciler).reconcileStanzaCreate" name=main-db namespace=default reconciler=pgBackRest reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.0-0

My cluster configuration is almost exactly the one from the tutorial, but here it is for reference (as a Gist since it's an unsupported file format).

Additional Information

I had a slightly different configuration (with a repo1) at first, but then removed it as this issue started appearing. Not sure if it's relevant, but it doesn't function anymore either way.

It's possible this is an issue not directly relating to PGO, but no idea.

flotothemoon avatar Jul 21 '21 19:07 flotothemoon

Update: This issue occurs even with a fresh cluster given the above configuration:

time="2021-07-22T07:25:30Z" level=error msg="unable to create stanza" error="command terminated with exit code 47: ERROR: [047]: unable to create path '/pgbackrest/default': 
[30] Read-only file system\n" file="internal/controller/postgrescluster/pgbackrest.go:2324" func="postgrescluster.(*Reconciler).reconcileStanzaCreate" name=main-db namespace=default reconciler=pgBackRest reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.0-0

However, when I change my repo path from /pgbackrest/default/main-fh/repo2 (as specified in the examples) to /backrestrepo/default/main-fh/repo2 (with backrestrepo as the repository path instead of pgbackrest as specified here) it seems to work.

This seems weird to me, like I'm doing something else wrong. Any clues?

flotothemoon avatar Jul 22 '21 07:07 flotothemoon

Had you originally created a cluster with that original path for pgBackRest and had it succeed? Did the issue occur when you tried creating a new cluster with the same path?

jkatz avatar Jul 23 '21 15:07 jkatz

Had you originally created a cluster with that original path for pgBackRest and had it succeed? Did the issue occur when you tried creating a new cluster with the same path?

It worked once If I recall correctly, but it stopped working after some time. I then changed the repo paths and it worked for some time. I also tried creating a new and separate cluster with the original path and it didn't work.

Now, after some time, I'm seeing issues with the the adapted path (using backrestrepo) as well. These are the logs for the backup job on my repo1:

time="2021-07-24T10:11:35Z" level=info msg="crunchy-pgbackrest starts"
time="2021-07-24T10:11:35Z" level=info msg="debug flag set to false"
time="2021-07-24T10:11:35Z" level=info msg="backrest backup command requested"
time="2021-07-24T10:11:35Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=incr]"
time="2021-07-24T10:12:42Z" level=info msg="output=[]"
time="2021-07-24T10:12:42Z" level=info msg="stderr=[WARN: a timeline switch has occurred since the 20210723-010006F_20210723-120003I backup, enabling delta checksum\n      HINT: this is normal after restoring from backup or promoting a standby.\nWARN: resumable backup 20210723-010006F_20210724-020005I of same type exists -- remove invalid files and resume\nERROR: [082]: WAL segment 000000030000000100000048 was not archived before the 60000ms timeout\n       HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n       HINT: check the PostgreSQL server log for errors.\n       HINT: run the 'start' command if the stanza was previously stopped.\n]"
time="2021-07-24T10:12:42Z" level=fatal msg="command terminated with exit code 82"

And these are the logs for the backup job on my repo2:

ERROR: [104]: archive-push command encountered error(s):
       repo2: [FileMissingError] unable to load info file '/backrestrepo/default/main-fh/repo2/archive/db/archive.info' or '/backrestrepo/default/main-fh/repo2/archive/db/archive.info.copy':
       FileMissingError: raised from remote-0 protocol on 'main-db-repo-host-0.main-db-pods.default.svc.cluster.local.': unable to open missing file '/backrestrepo/default/main-fh/repo2/archive/db/archive.info' for read
       FileMissingError: raised from remote-0 protocol on 'main-db-repo-host-0.main-db-pods.default.svc.cluster.local.': unable to open missing file '/backrestrepo/default/main-fh/repo2/archive/db/archive.info.copy' for read
       HINT: archive.info cannot be opened but is required to push/get WAL segments.
       HINT: is archive_command configured correctly in postgresql.conf?
       HINT: has a stanza-create been performed?
       HINT: use --no-archive-check to disable archive checks during backup if you have an alternate archiving scheme.
2021-07-24 00:00:10.063 UTC [143] LOG:  archive command failed with exit code 104

Curiously, now that I've removed paths from both repos (S3 and K8 volume), it seems to work again. Let's see how long that lats. Any idea what happened so I can make sure this doesn't happen again?

flotothemoon avatar Jul 24 '21 10:07 flotothemoon

Curiously, now that I've removed paths from both repos (S3 and K8 volume), it seems to work again. Let's see how long that lats. Any idea what happened so I can make sure this doesn't happen again?

If you're backing up to the same storage system and there is no reclaim policy that deletes the old files, ensure the files are deleted, or rename the path.

If you changed paths mid-flight, you may need to create a new stanza. That can be done with the pgbackrest stanza-create command.

jkatz avatar Aug 05 '21 23:08 jkatz

I'm seeing this issue, even after I emptied the S3 bucket that I'm using as my repo. Where, how would I run the "pgbackrest stanza-create" command? Running it in the database cluster master doesn't work:

kubectl exec -n crunchy-db1 -it -c database     $(kubectl get pods -n crunchy-db1 --selector='postgres-operator.crunchydata.com/cluster=crunchy-db1,postgres-operator.crunchydata.com/role=master' -o name) -- pgbackrest stanza-create --stanza=db
ERROR: [047]: unable to create path '/var/lib/pgbackrest/archive': [30] Read-only file system
command terminated with exit code 47

edit: the repo host pod is probably more like right, but:

kubectl exec -ti -n crunchy-db1 -ti crunchy-db1-repo-host-0 -- pgbackrest stanza-create --stanza=db
ERROR: [041]: unable to get info for path/file '/var/lib/pgbackrest/archive/db/archive.info': [13] Permission denied
command terminated with exit code 41

stenwt avatar Aug 17 '21 15:08 stenwt

@stenwt A virtual eternity has passed but I found this thread when I was having the same issue. Just to inform you that the solution was in fact to start a shell in the primary DB node pod and just run pgbackrest stanza-create --stanza=db. I got no errors and it created the missing path on my s3 bucket.

After that backups started working and more importantly it started clearing wal files.

I have no idea why this happened. I had in fact deployed three identical clusters, testing, staging and prod, and this only happened on staging. For some reason the initial backup job wouldn't start and it gave this error;

ERROR: [055]: unable to load info file '/zorgo-staging/postgres-operator/cip/backup/db/backup.info' or '/zorgo-staging/postgres-operator/cip/backup/db/backup.  info.copy':
       FileMissingError: unable to open missing file '/zorgo-staging/postgres-operator/cip/backup/db/backup.info' for read
       FileMissingError: unable to open missing file '/zorgo-staging/postgres-operator/cip/backup/db/backup.info.copy' for read
       HINT: backup.info cannot be opened and is required to perform a backup.
       HINT: has a stanza-create been performed?

So somewhere it just forgot to do stanza-create for some reason. The only cause I can imagine is that my s3 creds were actually formatted wrong when the clusters were first created. So they were unusable, contained an @ sign by mistake.

I solved that easily in testing and prod by just removing the @ sign, but apparently that wasn't enough for staging. So maybe that explains why stanza-create was never run again on staging.

stemid avatar Feb 21 '22 11:02 stemid

I'm running into this same issue in PGO 5.2.1 and current pg14 (or 12)

I get ERROR: [047]: unable to create path '/var/lib/pgbackrest/archive': [30] Read-only file system both from k8s events from pgo, as well as if I try to run @stemid 's command from the primary.

Since that directory is read only in other working clusters, I'm forced to conclude that I'm missing config or env that would... probably point it over to /backrestrepo/ ? (or maybe /pgdata/pgbackrest

jaredkipe avatar Dec 15 '22 20:12 jaredkipe

I found the issue, my pgo-s3-creds config had two 'repos' in it...

[global]
repo1-something: something
repo2-something: something

This was because I was restoring it from S3 and the old db was configured with S3 as repo2, and my datasource used it...

Anyway, long story short, I didn't have any config for repo2 in the cluster definition and this must either abort out the config generation itself, or the config it creates is not picked up (or even commented on ) by pgbackrest.

I bet a lot of the other issues I'm seeing are also from similarly mis-configured pieces.... (I mean, someone could edit this secret and just make backups stop dead)

jaredkipe avatar Dec 15 '22 23:12 jaredkipe

Seems to me like most of the issues here have to do with S3 rather than the operator, so I'm going to close this issue, but feel free to reopen or open a new issue if necessary.

benjaminjb avatar Jan 03 '23 16:01 benjaminjb

(Though that said, PGO v5.0.x was at an inflection point around the time of the original post with some changes to the spec.backups.pgbackrest object, which makes it a little harder to do archeology on this particular problem. So in case it was not an issue with the S3 credentials, for the OP, I would be curious to see if this problem persisted in other, later versions of PGO.)

benjaminjb avatar Jan 03 '23 17:01 benjaminjb