barman Backups waiting to finalize on Standbys

In an specific configuration, when taking rsync - concurrent backups on a Standby with archive_mode = always, the barman backups taken on this server (specially when the load is low on the primary server) will wait with a message like: Asking PostgreSQL server to finalize the backup.

Backups would normally end when the primary server switches to a new wal file or in case it is forced with pg_swithch_wal().

From the documentation PG >= 10:

https://www.postgresql.org/docs/14/functions-admin.html#:~:text=There%20is%20an%20optional%20second%20parameter%20of%20type%20boolean.%20If%20false%2C%20the%20function%20will%20return%20immediately%20after%20the%20backup%20is%20completed%2C%20without%20waiting%20for%20WAL%20to%20be%20archived.%20This%20behavior%20is%20only%20useful%20with%20backup%20software%20that%20independently%20monitors%20WAL%20archiving

For this case, seems the code on pg_stop_backup could make a better use of the second option wait_for_archive=false in order to finalize the backup command. It can also make a better use of the settings --wait and --wait-timeout, as we believe that was the intention.

Mentioning @martinmarques and @tureba as they have been involved in some testings.

Nov 08 '21 18:11 lucianobotti

In effect, it would mean changing line https://github.com/EnterpriseDB/barman/blob/master/barman/postgres.py#L1242

From: pg_stop_backup(FALSE)

To never wait for archiving: pg_stop_backup(FALSE, FALSE)

Or to give it a boolean in the second argument indicating if the user provided the --wait argument or not.

I think the first option is adequate for Barman.

Nov 08 '21 19:11 tureba

FYI, this is related to a customer ticket RT75337

Nov 09 '21 01:11 martinmarques

I've spent a bit of time thinking this one through - ultimately I think the proposed change is fine, but I'm a bit concerned because we'd be changing some undocumented behaviour (specifically that even if barman backup runs without the --wait flag, it won't return until PostgreSQL has archived all WALs required for the consistency of the backup).

The current (abridged) behaviour of barman backup without the --wait is:

Barman will call pg_stop_backup which will use the default wait_for_archive value and wait until PostgreSQL has archived all WALs up to and including the last segment required for the consistency of the backup.
Barman will complete the backup and check whether all required WALs have been archived - if they have the backup status will be set to DONE, otherwise WAITING_FOR_WALS.

If the --wait option is used:

Barman will call pg_stop_backup in the exact same way as before and so will wait until PostgreSQL has archived all WALs up to and including the last segment required for consistency of the backup.
Barman will call its own archive function which moves WALs from the incoming directory on the Barman server to the archive location on the Barman server and updates the xlog.db metadata.
Barman will check whether the last required WAL segment is present in the archive - if not then we go back to step 2 until the --wait-timeout is reached (if set, otherwise we loop forever).
Once the archive check has succeeded Barman completes the backup, saving the status as DONE.

In both cases the barman backup command will not return until all required WALs have been archived and therefore have been safely copied to the Barman server. When --wait is used the command won't return until the WALs have been moved from the incoming directory on the Barman server to their final archive destination.

I agree with the assumptions in this issue about the intentions of the --wait option, specifically:

Using the --wait flag is intended to mean "do not return until all WALs are safely in the Barman WAL archive".
Omitting the --wait flag is intended to mean "return straight away regardless of where the WALs are because we will rely on PostgreSQL's archive_command and barman cron to make sure the WALs end up safely in the Barman archive".

Currently Barman achieves 1 but does not achieve 2.

If we make the proposed change and set the wait_for_archive argument to FALSE when calling pg_stop_backup then we still achieve 1 because Barman will do its own waiting until the required WALs are in its archive destination. We would also achieve 2 because pg_stop_backup continues to return the last required WAL segment but no longer waits for PostgreSQL to archive it.

My problem with this proposal is that existing Barman users may be unknowingly relying on the fact that Barman currently waits until PostgreSQL has archived the last required WAL even if --wait is not set. If such users are not monitoring their backup status and are simply using the fact that barman backup returns as an indicator that the WALs have been copied off the PostgreSQL server then this would be a breaking change.

I don't think this is a blocker to making this change because it doesn't break the documented behaviour. However, I think we need to be careful about how we describe it in the documentation and the release notes so that we do not leave some users in a position where they think their WALs have been archived (from the PostgreSQL perspective) when actually they have not.

Nov 10 '21 12:11 mikewallace1979

To summarize, the change here is going to be:

Barman will use wait_for_archive=FALSE when calling pg_stop_backup.
This means the backup will finish whether or not PostgreSQL considers the required WALs to have been archived - this will resolve the issue where backups cannot complete on standby servers unless a WAL switch happens to occur on the primary.

The following behaviour will be unchanged:

Backups for which Barman does not yet have the required WALs will continue to be in status WAITING_FOR_WALS until a subsequent barman cron run finds that the required WALs have been archived.
The --wait flag will cause barman backup to wait until barman determines the required WALs have been archived before returning.
The --wait-timeout flag determines how long barman backup will wait.

This only applies to PostgreSQL >= 10. Earlier versions will continue to wait at the pg_stop_backup stage since there is no wait_for_archive option.

May 13 '22 11:05 mikewallace1979

After a lot of testing and discussion the workaround proposed in PR #580 is being abandoned in favour of #596.

Jun 20 '22 12:06 mikewallace1979

The underlying issue here should be resolved by #596.

Sep 07 '22 10:09 mikewallace1979