If did pitr resotre with wrong timestamp by mistake, no way to recover DB anymore
Hi there, I got an interesting behavior and couldn't find a solution online. Appreciate any guide
Environment: CrunchyData postgres operator 4.7.4 + openshift 4.6.48
Use case: I have one full backup generated at 2022-01-13 00:17:06 +0000 UTC
bash-4.4$ pgo -n testsystem show backup pgotest
cluster: pgotest
storage type: posix
stanza: db
status: ok
cipher: none
db (current)
wal archive min/max (13-1)
full backup: 20220113-001706F
timestamp start/stop: 2022-01-13 00:17:06 +0000 UTC / 2022-01-13 00:17:36 +0000 UTC
wal start/stop: 000000010000000000000002 / 000000010000000000000002
database size: 31.0MiB, backup size: 31.0MiB
repository size: 3.8MiB, repository backup size: 3.8MiB
backup reference list:
bash-4.4$
I understand that I should not use a pitr timestamp prior to the oldest full backup. However, think about the case that somebody made a mistake: Somebody ran the following command:
pgo -n testsystem restore pgotest --pitr-target="2022-01-12 00:00:00.000000+00" --backup-opts="--type=time"
Since this pitr-target timestamp is older than the full backup, no doubt it will fail.
The thing is: I couldn't find a proper way to fix this kind of mistake. i.e. I can't bring the DB back to normal anymore (at least I can't find the proper way)
What I did was that: I ran the restore command again with a proper pitr-target time:
pgo -n testsystem restore pgotest --pitr-target="2022-01-13 01:00:00.000000+00" --backup-opts="--type=time"
Interestingly I got the following error logs in pgotest-bootstrap:
# oc logs pgotest-bootstrap-ct2sg
NWRAP_ERROR(1) - nwrap_files_cache_reload: Unable to open '/tmp/nss_wrapper/postgres/passwd' readonly -1:No such file or directory
NWRAP_ERROR(1) - nwrap_files_getpwuid: Error loading passwd file
Thu Jan 13 00:57:08 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:08 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:08 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:18 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:18 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:18 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:28 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:28 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:28 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:38 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:38 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:38 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:48 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:48 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:48 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
The NWRAP_ERROR can be ignored. It also exists in the successful restoration operation as well.
Looks like I need to do something to stop or delete the previous failed attempt first but I couldn't find how to do so. Or what I should do to restore the DB properly.
Removing '/pgdata/pgotest.initializing' won't help. In fact the logs indicate that I need to correct the issue first and then remove the "pgotest.initializing" and try again.
Many thanks in advance.
@stanleyyzhu when you run into this scenario you can do the following:
- First,
kubectl execinto the "bootstrap" Pod:
kubectl exec -it mycluster-bootstrap-tk99r -c database -- bash
- Then, remove the "initializing" file:
rm /pgdata/mycluster.initializing
- And finally, rename the
PGDATAdirectory (since it will have been renamed with a timestamp following the failure, e.g. topgotest_2022-01-13-00-56-07in your example). Note that this step isn't necessary if you want to restore into an emptyPGDATAdirectory (renaming enables a pgBackRest delta restore):
mv /pgdata/mycluster_2022-01-27-01-33-51 mycluster
At this point you can exit from the container, and then re-attempt the pgo restore using the proper/correct timestamp.
Wow~ Sorry andrewlecuyer for my late response, and thank you for sharing the guide/steps with me. I'll give it a try and get back to you here. Really appreciate your help.
[Feb 02] Sorry @andrewlecuyer, my test bed crashed... I'm trying to get it restored. Once the test environment is back to normal, I'll try your proposal and get back to you.
Thank you @andrewlecuyer, It works! Our testbed was just restored and I tested with the steps you provided. It works :) Thank you so much for helping me on this
@stanleyyzhu I don't understand. If you apply a restore older than a full backup but earlier than an incremential backup, it should not fail. I've just did a restore a PITR restore based on time, older than a full backup but earlier than an incremential backup.
I wonder if you have incremential backups configured or not?
@bzdgn , sorry for the late reply. In my case, there isn't any incremental backup older than the earliest initial full backup.
The case is: If somebody (the "somebody" is me, who was doing this test) made a mistake, picked up a timestamp earlier than the initial backup, the recovery would fail no doubt. After that, I wanted to fix the problem so I selected a timestamp later than a full backup, I got the error mentioned in this ticket ("Detected an earlier failed attempt to initialize"). Then I needed to use the method that andrewlecuyer mentioned to delete the previous attempt.
The backup file I have:
bash-4.4$ pgo -n testsystem show backup pgotest
cluster: pgotest
storage type: posix
stanza: db
status: ok
cipher: none
db (current)
wal archive min/max (13-1)
full backup: 20220113-001706F
timestamp start/stop: 2022-01-13 00:17:06 +0000 UTC / 2022-01-13 00:17:36 +0000 UTC
wal start/stop: 000000010000000000000002 / 000000010000000000000002
database size: 31.0MiB, backup size: 31.0MiB
repository size: 3.8MiB, repository backup size: 3.8MiB
backup reference list:
bash-4.4$
The mistake that somebody made:
pgo -n testsystem restore pgotest --pitr-target="2022-01-12 00:00:00.000000+00" --backup-opts="--type=time"
You can see that the --pitr-target selected by the above command (Jan 12), is earlier than the oldest backup (Jan 13, which is the initial full backup when creating DB).
Then the command I used to fix this:
pgo -n testsystem restore pgotest --pitr-target="2022-01-13 01:00:00.000000+00" --backup-opts="--type=time"
This time I selected --pitr-target Jan 13 01:00:00, which was later than the full backup timestamp (started at 00:17:06 and stopped at 00:17:36 on Jan 13)
And then the error I saw in the log:
# oc logs pgotest-bootstrap-ct2sg
NWRAP_ERROR(1) - nwrap_files_cache_reload: Unable to open '/tmp/nss_wrapper/postgres/passwd' readonly -1:No such file or directory
NWRAP_ERROR(1) - nwrap_files_getpwuid: Error loading passwd file
Thu Jan 13 00:57:08 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:08 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:08 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:18 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:18 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:18 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:28 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:28 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:28 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Thu Jan 13 00:57:38 UTC 2022 WARN: Detected an earlier failed attempt to initialize
Thu Jan 13 00:57:38 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
Thu Jan 13 00:57:38 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
Since this issue was resolved per https://github.com/CrunchyData/postgres-operator/issues/2979#issuecomment-1038626940, I am proceeding with closing this issue.