Backups waiting to finalize on Standbys
In an specific configuration, when taking rsync - concurrent backups on a Standby with archive_mode = always, the barman backups taken on this server (specially when the load is low on the primary server) will wait with a message like:
Asking PostgreSQL server to finalize the backup.
Backups would normally end when the primary server switches to a new wal file or in case it is forced with pg_swithch_wal().
From the documentation PG >= 10:
https://www.postgresql.org/docs/14/functions-admin.html#:~:text=There%20is%20an%20optional%20second%20parameter%20of%20type%20boolean.%20If%20false%2C%20the%20function%20will%20return%20immediately%20after%20the%20backup%20is%20completed%2C%20without%20waiting%20for%20WAL%20to%20be%20archived.%20This%20behavior%20is%20only%20useful%20with%20backup%20software%20that%20independently%20monitors%20WAL%20archiving
For this case, seems the code on pg_stop_backup could make a better use of the second option wait_for_archive=false in order to finalize the backup command. It can also make a better use of the settings --wait and --wait-timeout, as we believe that was the intention.
Mentioning @martinmarques and @tureba as they have been involved in some testings.
In effect, it would mean changing line https://github.com/EnterpriseDB/barman/blob/master/barman/postgres.py#L1242
From: pg_stop_backup(FALSE)
To never wait for archiving: pg_stop_backup(FALSE, FALSE)
Or to give it a boolean in the second argument indicating if the user provided the --wait argument or not.
I think the first option is adequate for Barman.
FYI, this is related to a customer ticket RT75337
I've spent a bit of time thinking this one through - ultimately I think the proposed change is fine, but I'm a bit concerned because we'd be changing some undocumented behaviour (specifically that even if barman backup runs without the --wait flag, it won't return until PostgreSQL has archived all WALs required for the consistency of the backup).
The current (abridged) behaviour of barman backup without the --wait is:
- Barman will call
pg_stop_backupwhich will use the defaultwait_for_archivevalue and wait until PostgreSQL has archived all WALs up to and including the last segment required for the consistency of the backup. - Barman will complete the backup and check whether all required WALs have been archived - if they have the backup status will be set to
DONE, otherwiseWAITING_FOR_WALS.
If the --wait option is used:
- Barman will call
pg_stop_backupin the exact same way as before and so will wait until PostgreSQL has archived all WALs up to and including the last segment required for consistency of the backup. - Barman will call its own archive function which moves WALs from the
incomingdirectory on the Barman server to the archive location on the Barman server and updates thexlog.dbmetadata. - Barman will check whether the last required WAL segment is present in the archive - if not then we go back to step 2 until the
--wait-timeoutis reached (if set, otherwise we loop forever). - Once the archive check has succeeded Barman completes the backup, saving the status as
DONE.
In both cases the barman backup command will not return until all required WALs have been archived and therefore have been safely copied to the Barman server. When --wait is used the command won't return until the WALs have been moved from the incoming directory on the Barman server to their final archive destination.
I agree with the assumptions in this issue about the intentions of the --wait option, specifically:
- Using the
--waitflag is intended to mean "do not return until all WALs are safely in the Barman WAL archive". - Omitting the
--waitflag is intended to mean "return straight away regardless of where the WALs are because we will rely on PostgreSQL'sarchive_commandandbarman cronto make sure the WALs end up safely in the Barman archive".
Currently Barman achieves 1 but does not achieve 2.
If we make the proposed change and set the wait_for_archive argument to FALSE when calling pg_stop_backup then we still achieve 1 because Barman will do its own waiting until the required WALs are in its archive destination. We would also achieve 2 because pg_stop_backup continues to return the last required WAL segment but no longer waits for PostgreSQL to archive it.
My problem with this proposal is that existing Barman users may be unknowingly relying on the fact that Barman currently waits until PostgreSQL has archived the last required WAL even if --wait is not set. If such users are not monitoring their backup status and are simply using the fact that barman backup returns as an indicator that the WALs have been copied off the PostgreSQL server then this would be a breaking change.
I don't think this is a blocker to making this change because it doesn't break the documented behaviour. However, I think we need to be careful about how we describe it in the documentation and the release notes so that we do not leave some users in a position where they think their WALs have been archived (from the PostgreSQL perspective) when actually they have not.
To summarize, the change here is going to be:
- Barman will use
wait_for_archive=FALSEwhen callingpg_stop_backup. - This means the backup will finish whether or not PostgreSQL considers the required WALs to have been archived - this will resolve the issue where backups cannot complete on standby servers unless a WAL switch happens to occur on the primary.
The following behaviour will be unchanged:
- Backups for which Barman does not yet have the required WALs will continue to be in status
WAITING_FOR_WALSuntil a subsequentbarman cronrun finds that the required WALs have been archived. - The
--waitflag will causebarman backupto wait until barman determines the required WALs have been archived before returning. - The
--wait-timeoutflag determines how longbarman backupwill wait.
This only applies to PostgreSQL >= 10. Earlier versions will continue to wait at the pg_stop_backup stage since there is no wait_for_archive option.
After a lot of testing and discussion the workaround proposed in PR #580 is being abandoned in favour of #596.
The underlying issue here should be resolved by #596.