postgres-operator-examples
postgres-operator-examples copied to clipboard
data directory /pgdata/pg12 removed after a crash
Everything was right until a crash happened to my postgres cluster, and all data was lost, according to the log, postgres tries to restore a backup from S3 bucket, and didn't found backup.info.copy file.
Here is the log:
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg12' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
WARN: repo1: [FileMissingError] unable to load info file '/pgbackrest/prj-golden/pg-golden/repo1/backup/db/backup.info' or '/pgbackrest/prj-golden/pg-golden/repo1/backup/db/backup.info.copy':
FileMissingError: unable to open missing file '/pgbackrest/prj-golden/pg-golden/repo1/backup/db/backup.info' for read
FileMissingError: unable to open missing file '/pgbackrest/prj-golden/pg-golden/repo1/backup/db/backup.info.copy' for read
HINT: backup.info cannot be opened and is required to perform a backup.
HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
2022-06-04 00:25:29,096 ERROR: Error creating replica using method pgbackrest: 'bash' '-ceu' '--' 'install --directory --mode=0700 "${PGDATA?}" && exec "$@"' '-' 'pgbackrest' 'restore' '--delta' '--stanza=db' '--repo=1' '--link-map=pg_wal=/pgdata/pg12_wal' exited with code=75
2022-06-04 00:25:29,096 ERROR: failed to bootstrap (without leader)
2022-06-04 00:25:29,096 INFO: Removing data directory: /pgdata/pg12
I'm trying to know why the data directory is removed and how to prevent this next time.
PGO version : 5
I haven't seen this behavior before. Can you give more info about (a) your environment (including kubernetes version) and (b) the particular postgres cluster (such as the yaml)?
ETA: Wait, perhaps this was done by Patroni (which manages which postgres is the leader): https://github.com/zalando/patroni/blob/v2.0.2/patroni/ha.py#L249
So why did this cluster fail to bootstrap according to Patroni? Did this cluster have more than one replicas? I would be curious to hear about the postgres cluster's spec and images.
Thanks for your comment @benjaminjb
Right, I think the problem was with Patroni, I figured that my second replica was crashed, and when Patroni trigger the failover, the Ex-master remove its data, but the crashed replica can't get the head, and everything is erased.
I think it should check that the replica is up and synced before starting the failover process.
@khalMeg can you provide the spec you're using to perform the restore? Based on your description I assume you are trying to create a new cluster by restoring from a backup created before the prior cluster crashed, but if there are any details I am missing please let me know.
Additionally, can you confirm that backups were completing successfully in the S3 bucket you are try to restore from prior to the restore?
And lastly, can you provide the specific version of PGO v5 that you are using?
As mentioned by @benjaminjb, Patroni is simply removing the data directory after a failed boostrap attempt. This is expected, since Patroni views PGDATA as invalid as a result of the failed restore, and therefore removes the invalid content for another restore attempt.