postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

If did pitr resotre with wrong timestamp by mistake, no way to recover DB anymore

Open stanleyyzhu opened this issue 2 years ago • 5 comments

Hi there, I got an interesting behavior and couldn't find a solution online. Appreciate any guide

Environment: CrunchyData postgres operator 4.7.4 + openshift 4.6.48

Use case: I have one full backup generated at 2022-01-13 00:17:06 +0000 UTC

bash-4.4$ pgo -n testsystem show backup pgotest

cluster: pgotest
storage type: posix

stanza: db
    status: ok
    cipher: none

    db (current)
        wal archive min/max (13-1)

        full backup: 20220113-001706F
            timestamp start/stop: 2022-01-13 00:17:06 +0000 UTC / 2022-01-13 00:17:36 +0000 UTC
            wal start/stop: 000000010000000000000002 / 000000010000000000000002
            database size: 31.0MiB, backup size: 31.0MiB
            repository size: 3.8MiB, repository backup size: 3.8MiB
            backup reference list:

bash-4.4$

I understand that I should not use a pitr timestamp prior to the oldest full backup. However, think about the case that somebody made a mistake: Somebody ran the following command: pgo -n testsystem restore pgotest --pitr-target="2022-01-12 00:00:00.000000+00" --backup-opts="--type=time"

Since this pitr-target timestamp is older than the full backup, no doubt it will fail.

The thing is: I couldn't find a proper way to fix this kind of mistake. i.e. I can't bring the DB back to normal anymore (at least I can't find the proper way)

What I did was that: I ran the restore command again with a proper pitr-target time: pgo -n testsystem restore pgotest --pitr-target="2022-01-13 01:00:00.000000+00" --backup-opts="--type=time"

Interestingly I got the following error logs in pgotest-bootstrap:

# oc logs pgotest-bootstrap-ct2sg
 NWRAP_ERROR(1) - nwrap_files_cache_reload: Unable to open '/tmp/nss_wrapper/postgres/passwd' readonly -1:No such file or directory
 NWRAP_ERROR(1) - nwrap_files_getpwuid: Error loading passwd file
 Thu Jan 13 00:57:08 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:08 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:08 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:18 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:18 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:18 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:28 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:28 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:28 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:38 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:38 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:38 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:48 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:48 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:48 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07

The NWRAP_ERROR can be ignored. It also exists in the successful restoration operation as well.

Looks like I need to do something to stop or delete the previous failed attempt first but I couldn't find how to do so. Or what I should do to restore the DB properly.

Removing '/pgdata/pgotest.initializing' won't help. In fact the logs indicate that I need to correct the issue first and then remove the "pgotest.initializing" and try again.

Many thanks in advance.

stanleyyzhu avatar Jan 13 '22 01:01 stanleyyzhu

@stanleyyzhu when you run into this scenario you can do the following:

  1. First, kubectl exec into the "bootstrap" Pod:
kubectl exec -it mycluster-bootstrap-tk99r -c database -- bash
  1. Then, remove the "initializing" file:
rm /pgdata/mycluster.initializing
  1. And finally, rename the PGDATA directory (since it will have been renamed with a timestamp following the failure, e.g. to pgotest_2022-01-13-00-56-07 in your example). Note that this step isn't necessary if you want to restore into an empty PGDATA directory (renaming enables a pgBackRest delta restore):
mv /pgdata/mycluster_2022-01-27-01-33-51 mycluster

At this point you can exit from the container, and then re-attempt the pgo restore using the proper/correct timestamp.

andrewlecuyer avatar Jan 27 '22 01:01 andrewlecuyer

Wow~ Sorry andrewlecuyer for my late response, and thank you for sharing the guide/steps with me. I'll give it a try and get back to you here. Really appreciate your help.

[Feb 02] Sorry @andrewlecuyer, my test bed crashed... I'm trying to get it restored. Once the test environment is back to normal, I'll try your proposal and get back to you.

stanleyyzhu avatar Jan 31 '22 23:01 stanleyyzhu

Thank you @andrewlecuyer, It works! Our testbed was just restored and I tested with the steps you provided. It works :) Thank you so much for helping me on this

stanleyyzhu avatar Feb 14 '22 04:02 stanleyyzhu

@stanleyyzhu I don't understand. If you apply a restore older than a full backup but earlier than an incremential backup, it should not fail. I've just did a restore a PITR restore based on time, older than a full backup but earlier than an incremential backup.

I wonder if you have incremential backups configured or not?

bzdgn avatar Aug 24 '22 11:08 bzdgn

@bzdgn , sorry for the late reply. In my case, there isn't any incremental backup older than the earliest initial full backup.

The case is: If somebody (the "somebody" is me, who was doing this test) made a mistake, picked up a timestamp earlier than the initial backup, the recovery would fail no doubt. After that, I wanted to fix the problem so I selected a timestamp later than a full backup, I got the error mentioned in this ticket ("Detected an earlier failed attempt to initialize"). Then I needed to use the method that andrewlecuyer mentioned to delete the previous attempt.

The backup file I have:

bash-4.4$ pgo -n testsystem show backup pgotest

cluster: pgotest
storage type: posix

stanza: db
    status: ok
    cipher: none

    db (current)
        wal archive min/max (13-1)

        full backup: 20220113-001706F
            timestamp start/stop: 2022-01-13 00:17:06 +0000 UTC / 2022-01-13 00:17:36 +0000 UTC
            wal start/stop: 000000010000000000000002 / 000000010000000000000002
            database size: 31.0MiB, backup size: 31.0MiB
            repository size: 3.8MiB, repository backup size: 3.8MiB
            backup reference list:

bash-4.4$

The mistake that somebody made: pgo -n testsystem restore pgotest --pitr-target="2022-01-12 00:00:00.000000+00" --backup-opts="--type=time" You can see that the --pitr-target selected by the above command (Jan 12), is earlier than the oldest backup (Jan 13, which is the initial full backup when creating DB).

Then the command I used to fix this: pgo -n testsystem restore pgotest --pitr-target="2022-01-13 01:00:00.000000+00" --backup-opts="--type=time" This time I selected --pitr-target Jan 13 01:00:00, which was later than the full backup timestamp (started at 00:17:06 and stopped at 00:17:36 on Jan 13)

And then the error I saw in the log:

# oc logs pgotest-bootstrap-ct2sg
 NWRAP_ERROR(1) - nwrap_files_cache_reload: Unable to open '/tmp/nss_wrapper/postgres/passwd' readonly -1:No such file or directory
 NWRAP_ERROR(1) - nwrap_files_getpwuid: Error loading passwd file
 Thu Jan 13 00:57:08 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:08 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:08 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:18 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:18 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:18 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:28 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:28 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:28 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07
 Thu Jan 13 00:57:38 UTC 2022 WARN: Detected an earlier failed attempt to initialize
 Thu Jan 13 00:57:38 UTC 2022 INFO: Correct the issue, remove '/pgdata/pgotest.initializing', and try again
 Thu Jan 13 00:57:38 UTC 2022 INFO: Your data might be in: /pgdata/pgotest_2022-01-13-00-56-07

stanleyyzhu avatar Aug 30 '22 13:08 stanleyyzhu

Since this issue was resolved per https://github.com/CrunchyData/postgres-operator/issues/2979#issuecomment-1038626940, I am proceeding with closing this issue.

andrewlecuyer avatar Mar 22 '23 13:03 andrewlecuyer