pghoard icon indicating copy to clipboard operation
pghoard copied to clipboard

Unable to restore from backup

Open shamil opened this issue 7 years ago • 9 comments

I've tried multiple times to accomplish a basic scenario.

  1. Start pghoard, which initially takes the first base backup
  2. Restore that backup to new instance

I'm using S3 storage to save the backups.

Every time I do restore I get errors about miising files in the backup storage

2017-07-04 09:17:07,531	TransferAgent	Thread-9	WARNING	'default/timeline/0000001C.history' not found from storage
2017-07-04 09:17:07,531	TransferAgent	Thread-9	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/0000001C.history', size: 0, took 0.028s
2017-07-04 09:17:07,531	TransferAgent	Thread-8	WARNING	'default/timeline/0000001D.history' not found from storage
2017-07-04 09:17:07,531	TransferAgent	Thread-8	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/0000001D.history', size: 0, took 0.028s
2017-07-04 09:17:07,535	TransferAgent	Thread-12	WARNING	'default/timeline/0000001E.history' not found from storage
2017-07-04 09:17:07,535	TransferAgent	Thread-12	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/0000001E.history', size: 0, took 0.030s
2017-07-04 09:17:07,535	TransferAgent	Thread-10	WARNING	'default/timeline/0000001A.history' not found from storage
2017-07-04 09:17:07,535	TransferAgent	Thread-10	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/0000001A.history', size: 0, took 0.030s
2017-07-04 09:17:07,599	TransferAgent	Thread-11	WARNING	'default/timeline/0000001B.history' not found from storage
2017-07-04 09:17:07,600	TransferAgent	Thread-11	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/0000001B.history', size: 0, took 0.098s
127.0.0.1 - - [04/Jul/2017 09:17:07] "GET /default/archive/0000001A.history HTTP/1.1" 404 -
2017-07-04 09:17:07,677	TransferAgent	Thread-8	WARNING	'default/timeline/00000019.history' not found from storage
2017-07-04 09:17:07,677	TransferAgent	Thread-8	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/00000019.history', size: 0, took 0.008s
2017-07-04 09:17:07,678	TransferAgent	Thread-9	WARNING	'default/timeline/0000001A.history' not found from storage
2017-07-04 09:17:07,678	TransferAgent	Thread-9	INFO	'DOWNLOAD' FAILED transfer of key: 'default/timeline/0000001A.history', size: 0, took 0.009s
127.0.0.1 - - [04/Jul/2017 09:17:07] "GET /default/archive/00000019.history HTTP/1.1" 404 -
2017-07-04 09:17:07,758	TransferAgent	Thread-12	WARNING	'default/xlog/0000001900000000000000F0' not found from storage
2017-07-04 09:17:07,758	TransferAgent	Thread-12	INFO	'DOWNLOAD' FAILED transfer of key: 'default/xlog/0000001900000000000000F0', size: 0, took 0.012s
2017-07-04 09:17:07,760	TransferAgent	Thread-9	WARNING	'default/xlog/0000001900000000000000EF' not found from storage
2017-07-04 09:17:07,760	TransferAgent	Thread-9	INFO	'DOWNLOAD' FAILED transfer of key: 'default/xlog/0000001900000000000000EF', size: 0, took 0.012s
2017-07-04 09:17:07,801	TransferAgent	Thread-11	WARNING	'default/xlog/0000001900000000000000F2' not found from storage
2017-07-04 09:17:07,802	TransferAgent	Thread-11	INFO	'DOWNLOAD' FAILED transfer of key: 'default/xlog/0000001900000000000000F2', size: 0, took 0.055s
2017-07-04 09:17:07,810	TransferAgent	Thread-8	INFO	'DOWNLOAD' transfer of key: 'default/xlog/0000001900000000000000F3', size: 797270, took 0.062s
2017-07-04 09:17:07,817	TransferAgent	Thread-10	WARNING	'default/xlog/0000001900000000000000F1' not found from storage
2017-07-04 09:17:07,818	TransferAgent	Thread-10	INFO	'DOWNLOAD' FAILED transfer of key: 'default/xlog/0000001900000000000000F1', size: 0, took 0.071s
127.0.0.1 - - [04/Jul/2017 09:17:07] "GET /default/archive/0000001900000000000000EF HTTP/1.1" 404 -

The thing is that those history/xlog files don't exist in the backup, and pghoard never uploaded them. I just see the base backup archive in the S3. What could be the reason.

I tried multiple combination of restore, including --restore-to-master

shamil avatar Jul 04 '17 11:07 shamil

The download FAILED is expected, PostgreSQL itself asks for nonexistent files when it's restoring data. (it doesn't follow the timelines exactly based on .history files but instead asks for possible files that may exist)

I assume the actual restore is working though right?

Ormod avatar Jul 04 '17 11:07 Ormod

@Ormod nope, the server fails to start

2017-07-04 11:21:06 UTC [8914-1] pghoard@[unknown] FATAL:  the database system is starting up
2017-07-04 11:21:06 UTC [8915-1] pghoard@[unknown] FATAL:  the database system is starting up
/usr/local/bin/pghoard_postgres_command: ERROR: '00000001000000000000000A' not found from archive
/usr/local/bin/pghoard_postgres_command: ERROR: '00000002.history' not found from archive
2017-07-04 11:21:11 UTC [8921-1] pghoard@[unknown] FATAL:  the database system is starting up
2017-07-04 11:21:11 UTC [8922-1] pghoard@[unknown] FATAL:  the database system is starting up
/usr/local/bin/pghoard_postgres_command: ERROR: '00000001000000000000000A' not found from archive
/usr/local/bin/pghoard_postgres_command: ERROR: '00000002.history' not found from archive
2017-07-04 11:21:16 UTC [8932-1] pghoard@[unknown] FATAL:  the database system is starting up
2017-07-04 11:21:16 UTC [8933-1] pghoard@[unknown] FATAL:  the database system is starting up
/usr/local/bin/pghoard_postgres_command: ERROR: '00000001000000000000000A' not found from archive
/usr/local/bin/pghoard_postgres_command: ERROR: '00000002.history' not found from archive

shamil avatar Jul 04 '17 11:07 shamil

In order to debug this we would need the whole log from both the archiving and restoring ends.

Ormod avatar Jul 04 '17 11:07 Ormod

@Ormod here are the logs, after each step (https://gist.github.com/shamil/6386042b222906c338382e8b2340da3c)

Thanks

shamil avatar Jul 04 '17 12:07 shamil

interestingly enough I tried similar scenario with wal-e and everything worked out of the box. Long term wal-e is not an option for me, because it doesn't support backups from slave like pghoard, so if we can get this fixed that will be great!

shamil avatar Jul 04 '17 13:07 shamil

I'm assuming you're closing your pghoard immediately after that basebackup, not giving it time to upload it's first WAL file. (hence it's not found when restoring)

PGHoard will at a minimum need one WAL file to be uploaded in addition to the basebackup.

Ormod avatar Jul 04 '17 13:07 Ormod

Tried that, even invoked select pg_switch_xlog(); to trigger wal switch. Similar problem.

shamil avatar Jul 04 '17 13:07 shamil

Well the log doesn't indicate any problems. You might want to turn log_level to DEBUG and see if you can spot why it's not uploading the WAL file.

Ormod avatar Jul 04 '17 13:07 Ormod

In the base_label file I see that START WAL LOCATION is set to 0000001E0000000200000052 but pghoard started to upload WAL files from 0000001E0000000200000054 which basically means we have 2 missing WAL files in the backup storage (S3 in my case)

Why would this happen?

base_label file

START WAL LOCATION: 2/52000060 (file 0000001E0000000200000052)
CHECKPOINT LOCATION: 2/52000098
BACKUP METHOD: streamed
BACKUP FROM: standby
START TIME: 2017-07-04 14:51:11 UTC
LABEL: pghoard_base_backup

shamil avatar Jul 04 '17 15:07 shamil