pg_probackup
pg_probackup copied to clipboard
ERROR: Thread [2]: WAL segment "..." is absent
При выполеннии инкрементальной копии (PAGE) возникает ошибка из заголовка
ERROR: Thread [2]: WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF" is absent ERROR: Thread [1]: Interrupted during WAL reading
При этом "ошибочный" файл в каталоге журналов присутствует
$ sudo -u postgres find /archive/postgresql/wal/servbd/ -name '000000010000119C000000FF*' /archive/postgresql/wal/servbd/000000010000119C000000FF.gz
А в журнале ПГ
INFO: pg_probackup archive-push WAL file: 000000010000119C000000FF, threads: 1/1, batch: 1/1, compression: zlib
INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 87ms
Вывод pg_probackup show
$ sudo -u postgres -i pg_probackup show --instance $(hostname -s) | head
==================================================================================================================================================
Instance Version ID Recovery Time Mode WAL Mode TLI Time Data WAL Zratio Start LSN Stop LSN Status
==================================================================================================================================================
servbd 11 R06POQ ---- PAGE ARCHIVE 1/1 1m:19s 0 0 1.00 11A4/10482018 0/0 ERROR
servbd 11 R007G1 2021-09-25 22:57:08+03 FULL ARCHIVE 1/0 57m:37s 74GB 80MB 1.83 119B/F000028 119B/148AAA10 OK
servbd 11 QZN8S1 2021-09-18 22:57:05+03 FULL ARCHIVE 1/0 57m:37s 68GB 80MB 1.88 1184/32000028 1184/3765F7A8 OK
servbd 11 QZC4S1 2021-09-12 22:00:37+03 PAGE ARCHIVE 1/1 38s 70MB 16MB 2.51 1177/2F000028 1177/300C0120 OK
servbd 11 QZAA41 2021-09-11 22:53:32+03 FULL ARCHIVE 1/0 54m:3s 65GB 64MB 1.90 1176/B9000028 1176/BD75A4F8 OK
servbd 11 QZ8FG1 2021-09-10 22:01:39+03 PAGE ARCHIVE 1/1 1m:41s 930MB 16MB 2.61 1176/9000028 1176/A1E1880 OK
servbd 11 QZ6KS1 2021-09-09 22:01:59+03 PAGE ARCHIVE 1/1 2m:1s 1222MB 16MB 2.37 1173/D2000028 1173/D336A1B0 OK
Вывод pg_probackup show --archive
ARCHIVE INSTANCE 'servbd'
=================================================================================================================================
TLI Parent TLI Switchpoint Min Segno Max Segno N segments Size Zratio N backups Status
=================================================================================================================================
1 0 0/0 000000010000113600000056 00000001000011A400000017 13040 82GB 2.48 26 DEGRADED
Посмотрел по другим инстансам, по другим проектам, везде статус DEGRADED
. Это нормально?
Вывод pg_probackup show-config
# Backup instance information
pgdata = /var/lib/pgpro/1c-11/data
system-identifier = 6926224934093253540
xlog-seg-size = 16777216
# Connection parameters
pgdatabase = backup
pghost = localhost
pguser = backup
# Replica parameters
replica-timeout = 5min
# Archive parameters
archive-timeout = 5min
# Logging parameters
log-level-console = INFO
log-level-file = OFF
log-filename = pg_probackup.log
log-rotation-size = 0TB
log-rotation-age = 0d
# Retention parameters
retention-redundancy = 2
retention-window = 30
wal-depth = 7
# Compression parameters
compress-algorithm = zlib
compress-level = 1
# Remote access parame
Пока собирал данные запустил создание копии с логами VERBOSE
, и что странно, копия сделалась ...
Тут же запустил создание копии скриптом (который делает копии и по другим серверам - успешно), опять ошибка ...
Попробую выяснить в чем отличие, по результатам дополню информацией
Дополняю, если делаю бекап с опцией -j 3
он выполняется успешно, а если делаю с -j 2
, то возникает вышеуказанная ошибка
"Хвост" журнала VERBOSE
выглядит вот так
LOG: Thread [1]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FC.gz"
LOG: Thread [1]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FD.gz"
VERBOSE: Thread [1]: Switched to LSN 119C/FD000040
VERBOSE: Thread [2]: Need to switch to the next WAL segment, page LSN 119C/FD000000, record being read LSN 119C/FCFFFF98
LOG: Thread [2]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FD.gz"
LOG: Thread [2]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FE.gz"
VERBOSE: Thread [2]: Need to switch to the next WAL segment, page LSN 119C/FF000000, record being read LSN 119C/FCFFFF98
LOG: Thread [2]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF.gz"
VERBOSE: Thread [1]: Need to switch to the next WAL segment, page LSN 119C/FE000000, record being read LSN 119C/FDE87AA8
LOG: Thread [1]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FE.gz"
VERBOSE: Thread [2]: Need to switch to the next WAL segment, page LSN 119D/0, record being read LSN 119C/FCFFFF98
LOG: Thread [2]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119D00000000.gz"
VERBOSE: Thread [2]: Switched to LSN 119D/E900
LOG: Thread [2]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF.gz"
VERBOSE: Thread [1]: Need to switch to the next WAL segment, page LSN 119C/FF000000, record being read LSN 119C/FDE87AA8
LOG: Thread [1]: Opening compressed WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF.gz"
VERBOSE: Thread [2]: Need to switch to the next WAL segment, page LSN 119D/0, record being read LSN 119D/E900
WARNING: Thread [2]: Could not read WAL record at 119C/FF000000
ERROR: Thread [2]: WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF" is absent
ERROR: Thread [1]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 30 sec
WARNING: backup in progress, stop backup
VERBOSE: (query) SET client_min_messages = warning;
VERBOSE: (query) SET datestyle = 'ISO, DMY';
VERBOSE: (query) SELECT pg_catalog.txid_snapshot_xmax(pg_catalog.txid_current_snapshot()), current_timestamp(0)::timestamptz, lsn, labelfile, spcmapfile FROM pg_catalog.pg_stop_backup(false, false)
WARNING: Backup R06U8R is running, setting its status to ERROR
На сколько я понимаю, два потока "подрались" за один файл!?
-j 2
и -j 4
падают с ошибкой
-j 3
и -j 5
нормально создают копию
Валидация выглядит следующим образом
user@servbd:~$ sudo -u postgres -i pg_probackup validate --instance $(hostname -s) --recovery-target-time="2021-09-28 00:00" -j 2
WARNING: Skipping backup R06WYQ, because it has non-valid status: ERROR
INFO: Validating backup R007G1
INFO: Backup R007G1 data files are valid
WARNING: Thread [2]: Could not read WAL record at 119C/FF000000
ERROR: Thread [2]: WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF" is absent
ERROR: Thread [1]: Interrupted during WAL reading
WARNING: Recovery can be done up to time 2021-09-27 11:08:08+03, xid 309913654 and LSN 119D/E900
ERROR: Not enough WAL records to time 2021-09-28 00:00:00+03
user@servbd:~$ sudo -u postgres -i pg_probackup validate --instance $(hostname -s) --recovery-target-time="2021-09-28 00:00" -j 3
WARNING: Skipping backup R06WYQ, because it has non-valid status: ERROR
INFO: Validating backup R007G1
INFO: Backup R007G1 data files are valid
INFO: Backup validation completed successfully on time 2021-09-28 00:00:00+03, xid 312157471 and LSN 119F/C6DF0C20
INFO: Backup R007G1 is valid.
INFO: Validate of backup R007G1 completed.
user@servbd:~$ sudo -u postgres -i pg_probackup validate --instance $(hostname -s) --recovery-target-time="2021-09-28 00:00" -j 4
WARNING: Skipping backup R06WYQ, because it has non-valid status: ERROR
INFO: Validating backup R007G1
INFO: Backup R007G1 data files are valid
INFO: Backup validation completed successfully on time 2021-09-28 00:00:00+03, xid 312157471 and LSN 119F/C6DF0C20
INFO: Backup R007G1 is valid.
INFO: Validate of backup R007G1 completed.
user@servbd:~$ sudo -u postgres -i pg_probackup validate --instance $(hostname -s) --recovery-target-time="2021-09-28 00:00" -j 5
WARNING: Skipping backup R06WYQ, because it has non-valid status: ERROR
INFO: Validating backup R007G1
INFO: Backup R007G1 data files are valid
INFO: Backup validation completed successfully on time 2021-09-28 00:00:00+03, xid 312157471 and LSN 119F/C6DF0C20
INFO: Backup R007G1 is valid.
INFO: Validate of backup R007G1 completed.
user@servbd:~$ sudo -u postgres -i pg_probackup validate --instance $(hostname -s) --recovery-target-time="2021-09-28 00:00" -j 2
WARNING: Skipping backup R06WYQ, because it has non-valid status: ERROR
INFO: Validating backup R007G1
INFO: Backup R007G1 data files are valid
WARNING: Thread [2]: Could not read WAL record at 119C/FF000000
ERROR: Thread [2]: WAL segment "/archive/postgresql/wal/servbd/000000010000119C000000FF" is absent
ERROR: Thread [1]: Interrupted during WAL reading
WARNING: Recovery can be done up to time 2021-09-27 11:08:08+03, xid 309913654 and LSN 119D/E900
ERROR: Not enough WAL records to time 2021-09-28 00:00:00+03
Запуск осуществлялся один за одним ...
Добрый день!
Посмотрел по другим инстансам, по другим проектам, везде статус DEGRADED. Это нормально?
Это следствие wal-depth = 7
. Это нормально, но выглядит коряво, в планах это поправить.
На сколько я понимаю, два потока "подрались" за один файл!?
Очень странно, постараюсь разобраться. Это баг, конечно
Вести "с полей" следующие
$ pg_probackup-10 version
pg_probackup 2.5.4 (PostgreSQL 10.19)
Копия в 3 потока, которая раньше решала ...
$ sudo -u probackup -i pg_probackup backup -b PAGE --compress --instance srvdb -j 3
INFO: Backup start, pg_probackup version: 2.5.4, instance: srvdb, backup ID: R92YEP, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress
-algorithm: zlib, compress-level: 1
WARNING: This PostgreSQL instance was initialized without data block checksums. pg_probackup have no way to detect data block corruption without them.
Reinitialize PGDATA with option '--data-checksums'.
INFO: wait for pg_start_backup()
WARNING: Backup R92XXT has status: ERROR. Cannot be a parent.
WARNING: Backup R92XQV has status: ERROR. Cannot be a parent.
WARNING: Backup R92X1B has status: ERROR. Cannot be a parent.
INFO: Parent backup: R907C1
INFO: PGDATA size: 536GB
INFO: Extracting pagemap of changed blocks
WARNING: Thread [1]: Could not read WAL record at 12C8/2A000000
ERROR: Thread [1]: WAL segment "/data/probackup/wal/srvdb/00000003000012C80000002A" is absent
ERROR: Thread [3]: Interrupted during WAL reading
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 37 sec
WARNING: backup in progress, stop backup
WARNING: Backup R92YEP is running, setting its status to ERROR
Копия в один поток
$ sudo -u probackup -i pg_probackup backup -b PAGE --compress --instance srvdb
INFO: Backup start, pg_probackup version: 2.5.4, instance: srvdb, backup ID: R92YHE, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress
-algorithm: zlib, compress-level: 1
WARNING: This PostgreSQL instance was initialized without data block checksums. pg_probackup have no way to detect data block corruption without them.
Reinitialize PGDATA with option '--data-checksums'.
INFO: wait for pg_start_backup()
WARNING: Backup R92YEP has status: ERROR. Cannot be a parent.
WARNING: Backup R92XXT has status: ERROR. Cannot be a parent.
WARNING: Backup R92XQV has status: ERROR. Cannot be a parent.
WARNING: Backup R92X1B has status: ERROR. Cannot be a parent.
INFO: Parent backup: R907C1
INFO: PGDATA size: 536GB
INFO: Extracting pagemap of changed blocks
WARNING: Thread [1]: Could not read WAL record at 12C8/29000000
ERROR: Thread [1]: WAL segment "/data/probackup/wal/srvdb/00000003000012C800000029" is absent
ERROR: Pagemap extraction failed, time elasped: 96 sec
WARNING: backup in progress, stop backup
WARNING: Backup R92YHE is running, setting its status to ERROR
Занятно, но "отсутсвует" совсем другой файл wal. В два потока ругается на тот же файл, что при копировании в один поток А вот в 5 потоков "проглотил"
$ sudo -u probackup -i pg_probackup backup -b PAGE --compress --instance srvdb -j 5
INFO: Backup start, pg_probackup version: 2.5.4, instance: srvdb, backup ID: R92YKA, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress-algorithm: zlib, compress-level: 1
WARNING: This PostgreSQL instance was initialized without data block checksums. pg_probackup have no way to detect data block corruption without them. Reinitialize PGDATA with option '--data-checksums'.
INFO: wait for pg_start_backup()
WARNING: Backup R92YHE has status: ERROR. Cannot be a parent.
WARNING: Backup R92YEP has status: ERROR. Cannot be a parent.
WARNING: Backup R92XXT has status: ERROR. Cannot be a parent.
WARNING: Backup R92XQV has status: ERROR. Cannot be a parent.
WARNING: Backup R92X1B has status: ERROR. Cannot be a parent.
INFO: Parent backup: R907C1
INFO: PGDATA size: 536GB
INFO: Extracting pagemap of changed blocks
INFO: Pagemap successfully extracted, time elapsed: 50 sec
INFO: Start transferring data files
INFO: Data files are transferred, time elapsed: 2m:50s
INFO: wait for pg_stop_backup()
INFO: pg_stop backup() successfully executed
INFO: Syncing backup files to disk
INFO: Backup files are synced, time elapsed: 3s
INFO: Validating backup R92YKA
INFO: Backup R92YKA data files are valid
INFO: Backup R92YKA resident size: 3041MB
INFO: Backup R92YKA completed
Некоторое количество журналов с probackup 2.5.5
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress
INFO: Backup start, pg_probackup version: 2.5.5, instance: srvdb, backup ID: RSO9SZ, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compres
s-algorithm: zlib, compress-level: 1
INFO: wait for pg_start_backup()
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/000000020000066400000021 to be archived
INFO: PGDATA size: 263GB
INFO: Extracting pagemap of changed blocks
WARNING: Thread [1]: Could not read WAL record at 654/DB000000
ERROR: Thread [1]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DB" is absent
ERROR: Pagemap extraction failed, time elasped: 156 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSO9SZ is running, setting its status to ERROR
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 3
INFO: Backup start, pg_probackup version: 2.5.5, instance: srvdb, backup ID: RSOACF, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compres
s-algorithm: zlib, compress-level: 1
INFO: wait for pg_start_backup()
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/000000020000066400000026 to be archived
INFO: PGDATA size: 263GB
INFO: Extracting pagemap of changed blocks
WARNING: Thread [3]: Could not read WAL record at 654/DC000000
ERROR: Thread [3]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DC" is absent
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Thread [1]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 144 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSOACF is running, setting its status to ERROR
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 4
INFO: Backup start, pg_probackup version: 2.5.5, instance: srvdb, backup ID: RSODEH, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compres
s-algorithm: zlib, compress-level: 1
INFO: wait for pg_start_backup()
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/000000020000066400000065 to be archived
INFO: PGDATA size: 263GB
INFO: Extracting pagemap of changed blocks
WARNING: Thread [4]: Could not read WAL record at 654/DD000000
ERROR: Thread [4]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DD" is absent
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Thread [3]: Interrupted during WAL reading
ERROR: Thread [1]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 169 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSODEH is running, setting its status to ERROR
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 5
INFO: Backup start, pg_probackup version: 2.5.5, instance: srvdb, backup ID: RSODKY, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compres
s-algorithm: zlib, compress-level: 1
INFO: wait for pg_start_backup()
WARNING: Backup RSODEH has status: ERROR. Cannot be a parent.
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/000000020000066400000069 to be archived
INFO: PGDATA size: 263GB
INFO: Extracting pagemap of changed blocks
WARNING: Thread [4]: Could not read WAL record at 654/DF000000
ERROR: Thread [4]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DF" is absent
ERROR: Thread [1]: Interrupted during WAL reading
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Thread [5]: Interrupted during WAL reading
ERROR: Thread [3]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 171 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSODKY is running, setting its status to ERROR
Обновился до probackup 2.5.11
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 2
INFO: Backup start, pg_probackup version: 2.5.11, instance: srvdb, backup ID: RSOEEI, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress-algorithm: zlib, compress-level: 1
INFO: This PostgreSQL instance was initialized with data block checksums. Data block corruption will be detected
INFO: Database backup start
INFO: wait for pg_start_backup()
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/000000020000066400000083 to be archived
INFO: PGDATA size: 263GB
INFO: Current Start LSN: 664/83000AE8, TLI: 2
INFO: Parent Start LSN: 650/7C0AB2C0, TLI: 2
INFO: Extracting pagemap of changed blocks
WARNING: Thread [2]: Could not read WAL record at 654/DB000000
ERROR: Thread [2]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DB" is absent
ERROR: Thread [1]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 148 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSOEEI is running, setting its status to ERROR
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 3
INFO: Backup start, pg_probackup version: 2.5.11, instance: srvdb, backup ID: RSOEJB, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress-algorithm: zlib, compress-level: 1
INFO: This PostgreSQL instance was initialized with data block checksums. Data block corruption will be detected
INFO: Database backup start
INFO: wait for pg_start_backup()
WARNING: Backup RSOEEI has status: ERROR. Cannot be a parent.
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/00000002000006640000008C to be archived
INFO: PGDATA size: 263GB
INFO: Current Start LSN: 664/8C000028, TLI: 2
INFO: Parent Start LSN: 650/7C0AB2C0, TLI: 2
INFO: Extracting pagemap of changed blocks
WARNING: Thread [1]: Could not read WAL record at 654/DC000000
ERROR: Thread [1]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DC" is absent
ERROR: Thread [3]: Interrupted during WAL reading
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 156 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSOEJB is running, setting its status to ERROR
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 4
INFO: Backup start, pg_probackup version: 2.5.11, instance: srvdb, backup ID: RSOETY, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compre
ss-algorithm: zlib, compress-level: 1
INFO: This PostgreSQL instance was initialized with data block checksums. Data block corruption will be detected
INFO: Database backup start
INFO: wait for pg_start_backup()
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/00000002000006640000009B to be archived
INFO: PGDATA size: 263GB
INFO: Current Start LSN: 664/9B004798, TLI: 2
INFO: Parent Start LSN: 650/7C0AB2C0, TLI: 2
INFO: Extracting pagemap of changed blocks
WARNING: Thread [1]: Could not read WAL record at 654/DE000000
ERROR: Thread [1]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DE" is absent
WARNING: Thread [4]: Could not read WAL record at 654/DD000000
ERROR: Thread [4]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DD" is absent
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Thread [3]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 153 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSOETY is running, setting its status to ERROR
$ sudo -u probackup -i pg_probackup-11 backup -b PAGE --instance srvdb --compress -j 5
INFO: Backup start, pg_probackup version: 2.5.11, instance: srvdb, backup ID: RSOF2B, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compre
ss-algorithm: zlib, compress-level: 1
INFO: This PostgreSQL instance was initialized with data block checksums. Data block corruption will be detected
INFO: Database backup start
INFO: wait for pg_start_backup()
INFO: Parent backup: RSLM26
INFO: Wait for WAL segment /data/backups/wal/srvdb/00000002000006640000009E to be archived
INFO: PGDATA size: 263GB
INFO: Current Start LSN: 664/9E000028, TLI: 2
INFO: Parent Start LSN: 650/7C0AB2C0, TLI: 2
INFO: Extracting pagemap of changed blocks
WARNING: Thread [4]: Could not read WAL record at 654/DE000000
ERROR: Thread [4]: WAL segment "/data/backups/wal/srvdb/0000000200000654000000DE" is absent
ERROR: Thread [3]: Interrupted during WAL reading
ERROR: Thread [5]: Interrupted during WAL reading
ERROR: Thread [2]: Interrupted during WAL reading
ERROR: Thread [1]: Interrupted during WAL reading
ERROR: Pagemap extraction failed, time elasped: 168 sec
WARNING: backup in progress, stop backup
WARNING: Backup RSOF2B is running, setting its status to ERROR
Наверное интересной выглядит зависимость "проблемного" файла журнала от количества потоков (что в 2.5.5, что в 2.5.11)
На текущий момент с данным конкретным хостом, полагаю, единственный выход - это сделать FULL
Однако, так как проблема встречается совершенно случайно на разных проектах и четкого сценария воспрозведения нет, а здесь воспроизводится стабильно, то может быть собрать какую-то дополнительную диагностическую информацию? Вопрос, какую?
Уж очень бы хотелось закрыть данную проблему! ;)
Готово к релизу. 2.5.12
На 2.5.12-1 проблема сохраняется
INFO: Backup start, pg_probackup version: 2.5.12, instance: srvdb, backup ID: RT9NLH, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress-algorithm: zlib, compress-level: 1 WARNING: This PostgreSQL instance was initialized without data block checksums. pg_probackup have no way to detect data block corruption without them. Reinitialize PGDATA with option '--data-checksums'. INFO: Database backup start INFO: wait for pg_start_backup() INFO: Parent backup: RT7SMT INFO: PGDATA size: 418GB INFO: Current Start LSN: 1157/BC000028, TLI: 1 INFO: Parent Start LSN: 114F/F8003418, TLI: 1 INFO: Extracting pagemap of changed blocks WARNING: Thread [2]: Could not read WAL record at 1154/3C000000 ERROR: Thread [2]: WAL segment "/data/backups/wal/srvdb/00000001000011540000003C" is absent ERROR: Thread [1]: Interrupted during WAL reading ERROR: Pagemap extraction failed, time elasped: 332 sec WARNING: A backup is in progress, stopping it. WARNING: Backup RT9NLH is running, setting its status to ERROR
Так а файл /data/backups/wal/srvdb/00000001000011540000003C или /data/backups/wal/srvdb/00000001000011540000003C.gz существует?
Если нет, то по какой причине? Что писал archive-push в логах постгресса?
Если у вас есть автозачистка wal-ов, как она настроена? Пользуетесь ли вы для этого pg_probackup? Если да, то с какими опциями? Если нет, то чем и как?
Так а файл /data/backups/wal/srvdb/00000001000011540000003C или /data/backups/wal/srvdb/00000001000011540000003C.gz существует?
Если вернуться к стартовому посту, то там показываю, что файл присутстсвует в хранилище и способ его доставки в хранилище так же виден из лога ПГ
Сейчас попробую собрать "материал" на текущей "проблемной" ситуации
Прошу прощения за "кипиш", похоже ложная тревога!
Описываю текущую ситуацию:
- В каталоге с журналами был файл 00000001000011540000003C.gz.part
- В журнале PG присутствуют такие записи
2023-04-17 12:38:46 MSK [969]: [1-1]: INFO: pg_probackup archive-push WAL file: 00000001000011540000003C, threads: 1/1, batch: 1/1, compression: zlib 2023-04-17 12:38:46 MSK [969]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 403ms ERROR: Agent error: ssh: connect to host srvbackup port 2022: No route to host
2023-04-17 12:40:22.393 MSK [4268] СООБЩЕНИЕ: команда архивации завершилась ошибкой с кодом 1 2023-04-17 12:40:22.393 MSK [4268] ПОДРОБНОСТИ: Команда архивации с ошибкой: /usr/bin/pg_probackup-$(cat PG_VERSION) archive-push -B /data/backups --instance $(hostname -s) --wal-file-name=00000001000011540000003D --remote-user probackup --remote-host srvbackup --remote-port 2022 --compress ERROR: Agent error: ssh: connect to host srvbackup port 2022: No route to host
В этот момент произошел массовый рестар ВМ в ЦОД-е
Итог, фалик 00000001000011540000003C.gz.part переименовал в 00000001000011540000003C.gz, так как судя по логу он вроде как должен быть валидный, запустил создание резервной копии, все отработало штатно!
Получается так, что на стороне ПГ файл журнала уже пометился как заархивированный, и в каталоге кластера он был удален, а на файловом сервере еще не был переименован из-за нарушения связи с агентом pg_probackup?
Выглядит вроде как "баг" :-/
Итог, фалик 00000001000011540000003C.gz.part переименовал в 00000001000011540000003C.gz, так как судя по логу он вроде как должен быть валидный
Есть вероятность, что да: мы не проверяем успешность ренейма по ssh :-( Видимо, в этом и есть причина.
На всякий случай, запустите валидацию валов (pg_probackup validate ...)
На всякий случай, запустите валидацию валов (pg_probackup validate ...)
Валидацию прошел успешно
$ sudo -u probackup -i pg_probackup validate --instance srvdb --recovery-target-time=2023-04-18 15:30:00+03
INFO: Validating parents for backup RT7SMT
INFO: Validating backup RT5YJQ
INFO: Backup RT5YJQ data files are valid
INFO: Validating backup RT7SMT
INFO: Backup RT7SMT data files are valid
INFO: Backup validation completed successfully on time 2023-04-18 00:00:00+03, xid 1526978130 and LSN 1158/555DE498
INFO: Backup RT7SMT is valid.
INFO: Validate of backup RT7SMT completed.
Ошибся правда с указанием времени, забыл кавычки, но "проблемный" журнал был в 12:38 17.04.2023 исходя из лога ПГ, так что вроде все норм
И все таки оно "не работает" :-( Другой проект был на 2.5.8
INFO: Backup start, pg_probackup version: 2.5.8, instance: srvdb1, backup ID: RTBEY1, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress-algorithm: zlib, compress-level: 1 INFO: This PostgreSQL instance was initialized with data block checksums. Data block corruption will be detected INFO: Database backup start INFO: wait for pg_start_backup() WARNING: Backup RT9KA1 has status: ERROR. Cannot be a parent. INFO: Parent backup: RT7PM1 INFO: PGDATA size: 459GB INFO: Current Start LSN: 1ACC/E00000D8, TLI: 1 INFO: Parent Start LSN: 1AB3/F0000028, TLI: 1 INFO: Extracting pagemap of changed blocks WARNING: Thread [1]: Could not read WAL record at 1AB7/91000000 ERROR: Thread [1]: WAL segment "/data/probackup/wal/srvdb1/0000000100001AB700000091" is absent ERROR: Thread [2]: Interrupted during WAL reading ERROR: Pagemap extraction failed, time elasped: 36 sec WARNING: backup in progress, stop backup WARNING: Backup RTBEY1 is running, setting its status to ERROR
Обновился до 2.5.12-1
INFO: Backup start, pg_probackup version: 2.5.12, instance: srvdb1, backup ID: RTCSVG, backup mode: PAGE, wal mode: ARCHIVE, remote: true, compress-algorithm: zlib, compress-level: 1 INFO: This PostgreSQL instance was initialized with data block checksums. Data block corruption will be detected INFO: Database backup start INFO: wait for pg_start_backup() WARNING: Backup RTBEY1 has status: ERROR. Cannot be a parent. WARNING: Backup RT9KA1 has status: ERROR. Cannot be a parent. INFO: Parent backup: RT7PM1 INFO: PGDATA size: 459GB INFO: Current Start LSN: 1AD0/CD0000D8, TLI: 1 INFO: Parent Start LSN: 1AB3/F0000028, TLI: 1 INFO: Extracting pagemap of changed blocks WARNING: Thread [1]: Could not read WAL record at 1AB7/91000000 ERROR: Thread [1]: WAL segment "/data/probackup/wal/srvdb1/0000000100001AB700000091" is absent ERROR: Thread [2]: Interrupted during WAL reading ERROR: Pagemap extraction failed, time elasped: 35 sec WARNING: A backup is in progress, stopping it. WARNING: Backup RTCSVG is running, setting its status to ERROR
файлик вроде бы на месте
$ sudo find /data/probackup/wal/srvdb1/ -name '0000000100001AB700000091*'
/data/probackup/wal/srvdb1/0000000100001AB700000091.gz
Команда, которой выполняется резервная копия
$ sudo -u probackup -i pg_probackup-14 backup --instance srvdb1 -b PAGE -j 2 --compress
В журнале PG
2023-04-17 11:40:39 +04 [8407]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000100001AB700000090, threads: 1/1, batch: 1/1, compression: zlib 2023-04-17 11:40:40 +04 [8407]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, ti me elapsed: 410ms 2023-04-17 11:40:40 +04 [8423]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000100001AB700000091, threads: 1/1, batch: 1/1, compression: zlib 2023-04-17 11:40:40 +04 [8423]: [1-1]: INFO: pg_probackup archive-push completed successfully, pushed: 1, skipped: 0, time elapsed: 266ms 2023-04-17 11:40:41 +04 [8433]: [1-1]: INFO: pg_probackup archive-push WAL file: 0000000100001AB700000092, threads: 1/1, batch: 1/1, compression: zlib
Куда смотреть, чего собрать?
Коллеги, подскажите, есть ли подвижки? Может мы можем как-то помочь в решении (собрать какие-то дополнительные данные)?
Наблюдааем проблему если не на всех, то на многих проектах. Да, проблема "плавающая" и не имеет 100% сценария воспроизведения. Но если вы подскажете, что можно собрать дополнительно, то постараемся это сделать при возникновении очередного "сбоя" (возникает с некоторой регулярностью). Пока "лечим" созданием дополнительной FULL-копии
Мы кажется нашли причину. Она оказалась не тривиальной. Пушнем фикс на след.неделе.
Мы кажется нашли причину. Она оказалась не тривиальной. Пушнем фикс на след.неделе.
Круто, если так!