postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Postgres Restore failed

Open highgo-sunrk opened this issue 10 months ago • 1 comments

Environment

Please provide the following details:

  • Platform: Kubernetes
  • Platform Version: 1.23.7
  • PGO Image Tag:ubi8-5.7.3
  • Postgres Version 14
  • Storage: nfs

Steps to Reproduce

I performed two backups using pgBackRest. The first backup was automatically completed after the cluster was created. I executed the restore operation by modifying the CR file. pgbackrest info

stanza: db
    status: ok
    cipher: none

    db (current)
        wal archive min/max (14): 000000010000000000000001/00000002000000000000000E

        full backup: 20250310-023907F
            timestamp start/stop: 2025-03-10 02:39:07+00 / 2025-03-10 02:40:15+00
            wal start/stop: 000000010000000000000003 / 000000010000000000000004
            database size: 52.3MB, database backup size: 52.3MB
            repo1: backup set size: 6.7MB, backup size: 6.7MB

        full backup: 20250310-030103F
            timestamp start/stop: 2025-03-10 03:01:03+00 / 2025-03-10 03:01:46+00
            wal start/stop: 000000010000000000000006 / 000000010000000000000007
            database size: 52.4MB, database backup size: 52.4MB
            repo1: backup set size: 6.7MB, backup size: 6.7MB

        incr backup: 20250310-030103F_20250310-030518I
            timestamp start/stop: 2025-03-10 03:05:18+00 / 2025-03-10 03:05:25+00
            wal start/stop: 00000001000000000000000A / 00000001000000000000000B
            database size: 52.4MB, database backup size: 4.0MB
            repo1: backup set size: 6.7MB, backup size: 447.0KB
            backup reference list: 20250310-030103F

The first restore was to the backup set 20250310-030103F_20250310-030518I, which completed normally and the data was intact. I then restored again to the first backup set 20250310-023907F, which also restored normally. However, when I tried to restore to the backup set 20250310-030103F_20250310-030518I once more, an error occurred. Upon checking the logs of the restore pod, I found the following error message.

 kg logs pod/v9test-pgbackrest-restore-g6gq4 -f 
+ pgbackrest restore --type=time '--target=2025-03-10 07:04:46' --stanza=db --pg1-path=/pgdata/pg14 --repo=1 --delta --target-action=promote --link-map=pg_wal=/pgdata/pg14_wal
2025-03-10 07:29:53.448 GMT [20] LOG:  starting PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.4.0, 64-bit
2025-03-10 07:29:53.463 GMT [20] LOG:  listening on IPv6 address "::1", port 5866
2025-03-10 07:29:53.463 GMT [20] LOG:  listening on IPv4 address "127.0.0.1", port 5866
2025-03-10 07:29:53.470 GMT [20] LOG:  listening on IPv6 address "::1", port 1521
2025-03-10 07:29:53.470 GMT [20] LOG:  listening on IPv4 address "127.0.0.1", port 1521
2025-03-10 07:29:53.472 GMT [20] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5866"
2025-03-10 07:29:53.485 GMT [20] LOG:  listening on Unix socket "/tmp/.s.PGSQL.1521"
2025-03-10 07:29:53.530 GMT [20] LOG:  This is an trial version of hgdb-enterprise-9.0.1, valid until 2099年10月25日. After-sales service hotline: 400-708-8006. Online technical support platform:https://support.highgo.com/.
2025-03-10 07:29:53.541 GMT [21] LOG:  database system was interrupted; last known up at 2025-03-10 07:03:31 GMT
2025-03-10 07:30:07.674 GMT [21] LOG:  restored log file "00000003.history" from archive
2025-03-10 07:30:07.742 GMT [21] LOG:  restored log file "00000004.history" from archive
2025-03-10 07:30:07.820 GMT [21] LOG:  restored log file "00000005.history" from archive
2025-03-10 07:30:07.893 GMT [21] LOG:  starting point-in-time recovery to 2025-03-10 07:04:46+00
2025-03-10 07:30:07.961 GMT [21] LOG:  restored log file "00000005.history" from archive
2025-03-10 07:30:08.333 GMT [21] LOG:  restored log file "00000004000000000000000D" from archive
2025-03-10 07:30:08.667 GMT [21] LOG:  restored log file "00000004000000000000000C" from archive
2025-03-10 07:30:08.709 GMT [21] FATAL:  requested timeline 5 is not a child of this server's history
2025-03-10 07:30:08.709 GMT [21] DETAIL:  Latest checkpoint is at 0/D000060 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/F005068.
2025-03-10 07:30:08.712 GMT [20] LOG:  startup process (PID 21) exited with exit code 1
2025-03-10 07:30:08.712 GMT [20] LOG:  aborting startup due to startup process failure
2025-03-10 07:30:08.716 GMT [20] LOG:  database system is shut down
pg_ctl: could not start server
Examine the log output.
2025-03-10 07:30:09.013 GMT [39] LOG:  starting PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.4.0, 64-bit
2025-03-10 07:30:09.014 GMT [39] LOG:  listening on IPv6 address "::1", port 5866
2025-03-10 07:30:09.014 GMT [39] LOG:  listening on IPv4 address "127.0.0.1", port 5866
2025-03-10 07:30:09.022 GMT [39] LOG:  listening on IPv6 address "::1", port 1521
2025-03-10 07:30:09.022 GMT [39] LOG:  listening on IPv4 address "127.0.0.1", port 1521
2025-03-10 07:30:09.024 GMT [39] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5866"
2025-03-10 07:30:09.032 GMT [39] LOG:  listening on Unix socket "/tmp/.s.PGSQL.1521"
2025-03-10 07:30:09.074 GMT [39] LOG:  This is an trial version of hgdb-enterprise-9.0.1, valid until 2099年10月25日. After-sales service hotline: 400-708-8006. Online technical support platform:https://support.highgo.com/.
2025-03-10 07:30:09.086 GMT [40] LOG:  database system was interrupted; last known up at 2025-03-10 07:03:31 GMT
2025-03-10 07:30:29.056 GMT [40] LOG:  restored log file "00000003.history" from archive
2025-03-10 07:30:29.114 GMT [40] LOG:  restored log file "00000004.history" from archive
2025-03-10 07:30:29.172 GMT [40] LOG:  restored log file "00000005.history" from archive
2025-03-10 07:30:29.257 GMT [40] LOG:  starting point-in-time recovery to 2025-03-10 07:04:46+00
2025-03-10 07:30:29.453 GMT [40] LOG:  restored log file "00000005.history" from archive
2025-03-10 07:30:29.849 GMT [40] LOG:  restored log file "00000004000000000000000D" from archive
2025-03-10 07:30:30.218 GMT [40] LOG:  restored log file "00000004000000000000000C" from archive
2025-03-10 07:30:30.255 GMT [40] FATAL:  requested timeline 5 is not a child of this server's history
2025-03-10 07:30:30.255 GMT [40] DETAIL:  Latest checkpoint is at 0/D000060 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/F005068.
2025-03-10 07:30:30.258 GMT [39] LOG:  startup process (PID 40) exited with exit code 1
2025-03-10 07:30:30.258 GMT [39] LOG:  aborting startup due to startup process failure
2025-03-10 07:30:30.262 GMT [39] LOG:  database system is shut down
pg_ctl: could not start server
Examine the log output.

I modified the operator's code and added the following lines:

if foundTarget {
	opts = append(opts, "--target-action=promote", "--target-timeline=current")
}

With the modified operator, the primary node can start normally, but the secondary node encounters the following error.

2025-03-10 07:42:35,011 WARNING: Postgresql is not running.
2025-03-10 07:42:35,012 INFO: Lock owner: v9test-highgo-vgwk-0; I am v9test-highgo-d4nx-0
2025-03-10 07:42:35,018 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7480068508330737760
  Database cluster state: shut down in recovery
  pg_control last modified: Mon Mar 10 07:21:41 2025
  Latest checkpoint location: 0/D000060
  Latest checkpoint's REDO location: 0/C000550
  Latest checkpoint's REDO WAL file: 00000004000000000000000C
  Latest checkpoint's TimeLineID: 4
  Latest checkpoint's PrevTimeLineID: 4
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:818
  Latest checkpoint's NextOID: 40959
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 784
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 818
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Mon Mar 10 07:20:41 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/D000110
  Min recovery ending loc's timeline: 4
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_prepared_foreign_transactions setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 0bb02f977b2417c0296e95e2684c288621ab04cd0ed34e291a08ac247de78475
  database mode: 1
  case conversion mode: 1

2025-03-10 07:42:35,019 INFO: Lock owner: v9test-highgo-vgwk-0; I am v9test-highgo-d4nx-0
2025-03-10 07:42:35,024 INFO: starting as a secondary
2025-03-10 07:42:36,053 INFO: postmaster pid=114
/tmp/postgres:5866 - no response
2025-03-10 07:42:36.174 UTC [114] LOG:  pgaudit extension initialized
2025-03-10 07:42:36.364 UTC [114] LOG:  starting PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.4.0, 64-bit
2025-03-10 07:42:36.367 UTC [114] LOG:  listening on IPv4 address "0.0.0.0", port 5866
2025-03-10 07:42:36.367 UTC [114] LOG:  listening on IPv6 address "::", port 5866
2025-03-10 07:42:36.425 UTC [114] LOG:  listening on IPv6 address "::1", port 1521
2025-03-10 07:42:36.425 UTC [114] LOG:  listening on IPv4 address "127.0.0.1", port 1521
2025-03-10 07:42:36.445 UTC [114] LOG:  listening on Unix socket "/tmp/postgres/.s.PGSQL.5866"
2025-03-10 07:42:36.483 UTC [114] LOG:  listening on Unix socket "/tmp/postgres/.s.PGSQL.1521"
2025-03-10 07:42:36.788 UTC [114] LOG:  This is an trial version of hgdb-enterprise-9.0.1, valid until 2099年10月25日. After-sales service hotline: 400-708-8006. Online technical support platform:https://support.highgo.com/.
2025-03-10 07:42:36.852 UTC [117] LOG:  database system was shut down in recovery at 2025-03-10 07:21:41 UTC
2025-03-10 07:42:37.245 UTC [120] FATAL:  the database system is starting up
/tmp/postgres:5866 - rejecting connections
2025-03-10 07:42:37.258 UTC [122] FATAL:  the database system is starting up
/tmp/postgres:5866 - rejecting connections
2025-03-10 07:42:37.354 UTC [117] LOG:  restored log file "00000005.history" from archive
2025-03-10 07:42:37.652 UTC [117] LOG:  restored log file "00000006.history" from archive
2025-03-10 07:42:37.894 UTC [117] LOG:  entering standby mode
2025-03-10 07:42:38.053 UTC [117] LOG:  restored log file "00000006.history" from archive
2025-03-10 07:42:38.304 UTC [129] FATAL:  the database system is starting up
/tmp/postgres:5866 - rejecting connections
2025-03-10 07:42:38.618 UTC [117] LOG:  restored log file "00000002000000000000000D" from archive
2025-03-10 07:42:38.852 UTC [117] FATAL:  requested timeline 6 does not contain minimum recovery point 0/D000110 on timeline 4
2025-03-10 07:42:38.865 UTC [114] LOG:  startup process (PID 117) exited with exit code 1
2025-03-10 07:42:38.865 UTC [114] LOG:  aborting startup due to startup process failure
2025-03-10 07:42:38.885 UTC [114] LOG:  database system is shut down
/tmp/postgres:5866 - no response

Has anyone encountered this issue before? Is my code modification correct? Please help me with this.

highgo-sunrk avatar Mar 10 '25 10:03 highgo-sunrk


I’m not sure whether this logic originates from PostgreSQL itself or from pgBackRest.

However, as seen in the logs:

2025-03-10 07:30:30.255 GMT [40] FATAL:  requested timeline 5 is not a child of this server's history
requested timeline 6 does not contain minimum recovery point 0/D000110 on timeline 4
2025-03-10 07:42:38.865 UTC [114] LOG:  startup process (PID 117) exited with exit code 1

Each time you perform a restore or failover, PostgreSQL generates a new xxx.history file. For example:

Latest checkpoint is at 0/D000060 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/F005068.

In this case, you may need to explicitly specify the correct timeline during recovery using:

--target-timeline=2

Alternatively, you might consider removing the conflicting 00000006.history and 00000005.history files to prevent timeline mismatch errors during recovery.


ethanHealthy avatar Jun 28 '25 07:06 ethanHealthy