barman
barman copied to clipboard
Backups are always FAILED when use the option "--retry-times"
Hello,
Barman version $ barman --version 2.19 Barman by EnterpriseDB (www.enterprisedb.com)
Barman error
barman@barman-1:~$ cat "/var/log/barman/postgres-db.log"
Starting backup using postgres method for server postgres-db in /var/lib/barman/postgres-db/base/20220909T220001
Backup start at LSN: D67/C1DE86A8
Starting backup copy via pg_basebackup for 20220909T220001
WARNING: Failure executing a backup using pg_basebackup (attempt 0)
WARNING: The files copied so far will be removed and the backup process will restart in 1200 seconds
Copy done (time: 2 hours, 55 minutes, 32 seconds)
Finalising the backup.
ERROR: Backup failed copying files.
DETAILS: Could not find backup_id 20220909T220001
Processing xlog segments from streaming for postgres-db
0000000300000D6900000046
Enabled options --retry-times and --retry-sleep. These options allow to Barman to restart the backup automatically if the backup fails (pg_basebackup returns non 0 code).
After investigation I have found that the issue happens when pg_basebackup in the 1st attempt returns non 0 code (after that the backup is marked with status FAILED). Then when it automatically restarts the status FAILED isn't changed. Even when the 2nd attempt ends successfully the status isn't changing from FAILED to other. Next, because of the 2nd attempt of pg_basebackup ended successfully Barman tries to purge unused WAL files and call the function _purge_unused_wal_files from the file barman/backup_executor.py. There is code line:
previous_backup = self.backup_manager.get_previous_backup(backup_info.backup_id)
Note: The function by default uses one more argument status_filter=DEFAULT_STATUS_FILTER And DEFAULT_STATUS_FILTER = BackupInfo.STATUS_COPY_DONE, and in its turn STATUS_COPY_DONE = (WAITING_FOR_WALS, DONE)
In its turn the function get_previous_backup calls some other functions like get_available_backups and find_previous_backup_in with argument status_filter as well. But because the backup still has have the status FAILED all these functions ignore this backup. And at the end the function find_previous_backup_in returns error Could not find backup_id ... from exception below:
except ValueError:
raise UnknownBackupIdException("Could not find backup_id %s" % backup_id)
-- BR, Mykhailo K.
Thanks for the detailed bug report @mvk15.
The second part of the analysis, where the _purge_unused_wal_files
function results in an UnknownBackupIdException
due to the backup being filtered out due to having status FAILED
, makes sense to me. However, what I do not understand is how the backup ended up with a FAILED
status in the first place.
I have tried reproducing this with Barman 2.19 in a local Docker environment using a pg_basebackup wrapper script to simulate a failure the second time it is called (Barman calls it as part of a pre-flight check so it needs to return successfully the first time it runs). The behaviour I get is that the backup status does not change when the pg_basebackup
command fails - it remains in the STARTED
state and the backup then completes as normal, for example:
$ barman backup main --retry-times=1 --retry-sleep=0
Starting backup using postgres method for server main in /var/lib/barman/main/base/20220925T073758
Backup start at LSN: 0/430000C8 (000000010000000000000043, 000000C8)
Starting backup copy via pg_basebackup for 20220925T073758
WARNING: Failure executing a backup using pg_basebackup (attempt 0)
WARNING: The files copied so far will be removed and the backup process will restart in 0 seconds
WARNING: pg_basebackup does not copy the PostgreSQL configuration files that reside outside PGDATA. Please manually backup the following files:
/etc/postgresql/13/main/postgresql.conf
/etc/postgresql/13/main/pg_hba.conf
/etc/postgresql/13/main/pg_ident.conf
Copy done (time: 2 minutes, 1 second)
Finalising the backup.
Backup size: 23.3 MiB
Backup end at LSN: 0/45000000 (000000010000000000000044, 00000000)
Backup completed (start time: 2022-09-25 07:37:58.757869, elapsed time: 2 minutes, 1 second)
Based on my understanding of the code (which is unfortunately not guaranteed to be correct) there is nowhere in the retry path that the backup status is changed - the status should remain STARTED
until the backup process completes or fails for other reasons. So I am missing something but I don't know what.
Would you be able to help the debugging efforts by providing the Barman log output from the failed backup?
Hello,
Here a Barman log of one of a database (there are some my additional logs line):
barman@barman-0:~$ cat /var/log/barman/postgres-db.log
DEBUG: The default backup strategy for postgres backup_method is: concurrent_backup
DEBUG: backup cache is EMPTY
DEBUG: var "self.config.basebackups_directory" is /var/lib/barman/postgres-db/base
DEBUG: var "filename" is /var/lib/barman/postgres-db/base/20220924T220003/backup.info
DEBUG: var "filename" is /var/lib/barman/postgres-db/base/20220925T220004/backup.info
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004
Starting backup using postgres method for server postgres-db in /var/lib/barman/postgres-db/base/20220926T220002
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: var "backup.status" is EMPTY
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "current" is 2
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: var "backup.status" is EMPTY
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "current" is 2
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: detecting data directory
DEBUG: detecting tablespaces
Backup start at LSN: BA/C06112B0
Starting backup copy via pg_basebackup for 20220926T220002
WARNING: Failure executing a backup using pg_basebackup (attempt 0)
WARNING: The files copied so far will be removed and the backup process will restart in 1200 seconds
Copy done (time: 35 minutes, 55 seconds)
DEBUG: Reading backup label: /var/lib/barman/postgres-db/base/20220926T220002/data/backup_label
Finalising the backup.
DEBUG: var "backup.status" is FAILED
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "ids" is 20220924T220003 20220925T220004
ERROR: Backup failed copying files.
DETAILS: (find_previous_backup_in) Could not find backup_id 20220926T220002
DEBUG: var "backup.status" is FAILED
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "current" is 2
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: var "backup.status" is FAILED
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "current" is 2
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: Starting archive-wal for server postgres-db
Processing xlog segments from streaming for postgres-db
00000004000000BA000000C3
DEBUG: var "backup.status" is FAILED
DEBUG: backup cache is FULL
DEBUG: backup, var "_ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "ids" is 20220924T220003 20220925T220004 20220926T220002
DEBUG: (find_previous_backup_in) var "current" is 2
-- BR, Mykhailo K.
Thanks @mvk15. There should also be some logging output written to the log_file
defined in the Barman configuration (defaults to /var/log/barman/barman.log
) - would you be able to post the relevant section of that too?
Hello,
Can't see much info inside /var/log/barman/barman.log
. There are 2 failed backups for 2 databases:
barman@barman-0:~$ grep 'Failure executing a backup using pg_basebackup' /var/log/barman/barman.log.1
2022-09-26 22:08:15,825 [11375] barman.backup_executor WARNING: Failure executing a backup using pg_basebackup (attempt 0)
2022-09-26 23:42:26,873 [11399] barman.backup_executor WARNING: Failure executing a backup using pg_basebackup (attempt 0)
barman@barman-0:~$
barman@barman-0:~$ grep 11375 /var/log/barman/barman.log.1
2022-09-26 22:08:15,825 [11375] barman.backup_executor WARNING: Failure executing a backup using pg_basebackup (attempt 0)
2022-09-26 22:08:15,839 [11375] barman.backup_executor WARNING: The files copied so far will be removed and the backup process will restart in 1200 seconds
2022-09-26 22:36:10,683 [11375] barman.backup ERROR: Backup failed copying files.
barman@barman-0:~$
barman@barman-0:~$ grep 11399 /var/log/barman/barman.log.1
2022-09-26 23:42:26,873 [11399] barman.backup_executor WARNING: Failure executing a backup using pg_basebackup (attempt 0)
2022-09-26 23:42:27,552 [11399] barman.backup_executor WARNING: The files copied so far will be removed and the backup process will restart in 1200 seconds
and here how I start the database backup:
barman@barman-0:~$ grep postgres-db .crontab
# "postgres-db"
0 22 * * * nice -n 10 ionice -c2 -n5 barman -d backup postgres-db 2>&1 | tee "/var/log/barman/postgres-db.log"
So, all info from stdout
and stderr
inside the separate log file that I have showed before.
-- BR, Mykhailo K.
Thanks. If possible, could you set log_level = DEBUG
in the Barman configuration? The barman log file should then contain a little more detail.
@mvk15 can you reproduce the issue and get a FAILED status before retry? That would help a lot to solve it.
I have setup log_level = DEBUG
in Barman configuration however today all backups have successfully finished.
Will wait what will be tomorrow.