pgbackrest icon indicating copy to clipboard operation
pgbackrest copied to clipboard

2 stanzas, one postgresql cluster, 2 storage buckets, timeout [QUESTION]

Open Vox1984 opened this issue 1 year ago • 10 comments

Please provide the following information when submitting an issue (feature requests or general comments can skip this):

  1. pgBackRest version: 2.51

  2. PostgreSQL version: 14

  3. Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each: debian11

  4. Did you install pgBackRest from source or from a package? package

  5. Please attach the following as applicable: pgbackrest.conf

[global]
log-level-console=info
log-level-file=debug
start-fast=y
compress-type=zst
compress-level=1
process-max=4
archive-timeout=14400

# stanzas:
## disaster recovery, 28 days retention period.
[disaster_recovery]
pg1-path=/var/lib/postgresql/14/main/
repo1-bundle=y
repo1-cipher-pass=Aaaafsdfsdfdsfsdfsdfdsfdsfds
repo1-cipher-type=aes-256-cbc
repo1-gcs-bucket=gcp-disasterrecovery
repo1-path=/files
repo1-retention-full=28
repo1-retention-full-type=time
repo1-gcs-key-type=auto
repo1-type=gcs
### async settings:
archive-async=y
archive-push-queue-max=360GB
spool-path=/var/spool/pgbackrest
archive-get-queue-max=1GB

## long-term archive, retention of 1000 days, full backup weekly(no separate WAL).
[longterm_archive]
pg1-path=/var/lib/postgresql/14/main/
repo1-bundle=y
repo1-cipher-pass=gfsugfsufgudsgfudgsfgdsfgdsgfidsgfdus
repo1-cipher-type=aes-256-cbc
repo1-gcs-bucket=gcp-archive
repo1-path=/files
repo1-retention-full=1000
repo1-retention-full-type=time
repo1-gcs-key-type=auto
repo1-type=gcs

postgresql.conf archive command: archive_command = 'pgbackrest --stanza=disaster_recovery archive-push %p'

backup command: sudo -u postgres pgbackrest --stanza=longterm_archive --repo=1 --type=full --log-level-console=info backup 7. Describe the issue:

I have just a few WAL files on this database as it is low traffic one at the moment. Still while it works 100% for stanza: disaster_recovery it fails for the longterm_archive one.

ERROR: [082]: WAL segment 0000000100000003000000FF was not archived before the 1200000ms timeout
       HINT: check the archive_command to ensure that all options are correct (especially --stanza).
       HINT: check the PostgreSQL server log for errors.
       HINT: run the 'start' command if the stanza was previously stopped.

I suspect (correct me if I am wrong) that's because in postgresql.conf I only have set up archive_command to use only first repository (I thought it would be enough, I don't know how to combine those 2 stanzas if needed in one archive command?) As you see above I have 2 storage accounts, and I wanted to have 2 stanzas for convenience and clarity.

  • Is the only option really, to combine those 2 stanzas into one stanza with 2 repos?
  • Can 2 stanzas share one WAL archive?
  • Can I have 2 independent WAL archives for above config?

for longterm one I want just a simple backup, but it needs to be held for really long time. And I would like to avoid keeping WAL files for 1000 days because of storage size.

  • What would be the effective retention period for WAL files if above 2 are repo1 and repo2 not the stanza1 and stanza2?

Vox1984 avatar Apr 22 '24 09:04 Vox1984

Hi,

I suspect (correct me if I am wrong) that's because in postgresql.conf I only have set up archive_command to use only first repository (I thought it would be enough, I don't know how to combine those 2 stanzas if needed in one archive command?) As you see above I have 2 storage accounts, and I wanted to have 2 stanzas for convenience and clarity.

Indeed. You only configured PG to push the WAL segments to only one of the repositories. Imo, you're not using the stanza section correctly. Usually, repositories are globally defined and you should rather use repo option indexing (multi-repo feature), not multiple stanzas.

As stated in the docs:

A stanza is the configuration for a PostgreSQL database cluster that defines where it is located,
how it will be backed up, archiving options, etc. Most db servers will only have one PostgreSQL 
database cluster and therefore one stanza, whereas backup servers will have a stanza for every 
database cluster that needs to be backed up.

You only have 1 PG cluster, so you should only have 1 stanza configuration (containing the options to connect to that cluster).

Is the only option really, to combine those 2 stanzas into one stanza with 2 repos? Can't they somehow share the WAL archive? Or Can I have 2 independent WAL archives?

The WAL archives need to reside in the same location than the backups. The expire command will btw only remove WAL archives based on the oldest backup in the repository, so you'll need to take backups separately and individually for each repo (using i.e. the backup --repo=x option).

Your configuration should probably look like this:

[global]
log-level-console=info
log-level-file=debug
start-fast=y
compress-type=zst
process-max=4

### async settings:
archive-async=y
archive-push-queue-max=360GB
spool-path=/var/spool/pgbackrest
archive-get-queue-max=1GB

## disaster recovery, 28 days retention period
...
repo1-gcs-bucket=gcp-disasterrecovery
repo1-retention-full=28
repo1-retention-full-type=time

## long-term archive, retention of 1000 days
...
repo2-gcs-bucket=gcp-archive
repo2-retention-full=1000
repo2-retention-full-type=time

[my_stanza_name]
pg1-path=/var/lib/postgresql/14/main/

With this, the archive-push command will push the WAL segment to archive to both repositories at the same time.

Regards,

pgstef avatar Apr 22 '24 09:04 pgstef

Thanks pgstef for very quick reply.

But question remains, about the expiration (retention). I created 2 stanzas because I would like to avoid keeping of 1000 days worth of WAL files... I only need PITR recovery for the disaster_recovery stanza of 28 days.

I suspect for the config you have proposed above, the WAL would be kept for 1000 days, right? Because higher value would keep them from expiring?

Vox1984 avatar Apr 22 '24 09:04 Vox1984

No, you would have 1000 days of archives in the repository where the oldest backup is 1000 days old and 28 days of archives in the repository where the oldest backup is 28 days old.

As mentioned, WAL archives are expired based on oldest backup in the repo.

There's also the repo-retention-archive which lets reducing the amount of WAL files in a repo, but it's way more complicated to use when you're using full backup time expiration. Modifying the WAL archiving retention method is imo an advanced feature that should be used very carefulyl.

pgstef avatar Apr 22 '24 09:04 pgstef

Ah yes, right, there would be 2 separate WAL archive subfolders in each storage bucket.

So there is no option to shrink the size of stored WALs for longterm archive repo. 1000 days it is.

I think it would be good use case for a new pgbackrest feature to allow for a snapshot to be taken. Let's say just full backup with a few WAL files bundled together (15 minutes worth?) (because postgresql requires that I think) as a single point in time. And then one could take such a snapshot every 24 hours for the sake of long-term archive.

Not sure if its possible from technical point of view without stopping the db.

Vox1984 avatar Apr 22 '24 10:04 Vox1984

So there is no option to shrink the size of stored WALs for longterm archive repo. 1000 days it is.

Yes, there's a way. I wrote it in my last comment -> repo-retention-archive. If you don't need PITR capability between the backups in the repository, you can change the archives retention setting. But as also mentioned, mixing this with full backup time retention makes it very tricky.

I think it would be good use case for a new pgbackrest feature to allow for a snapshot to be taken. Let's say just full backup with a few WAL files bundled together (15 minutes worth?) (because postgresql requires that I think) as a single point in time. And then one could take such a snapshot every 24 hours for the sake of long-term archive.

Those snapshots as you call it are basically incr/diff backups. WAL files are basically the root of Postgres write mechanism. You need it for consistency, crash recovery, physical backups, replication,... You can't just through them away or mess with it. What you can do is to take incr or diff backups to not need to replay the WAL entries between the full and those incr/diff backups. From my perspective, what you're asking is there already.

Not sure if its possible from technical point of view without stopping the db.

It is, that's the magic of WAL files. You need them for your backup/snapshot consistency. At least those that have been generated between the db files copy. The others (generated after the backup) are there for PITR.

But for sure, if you want multiple repositories, you'll need to have backups in both (and the backup command only work per repository) and achieving some tricky retention configuration using time retention type will definitely give you some headaches. Way easier to combine number of backups and how often/when they are scheduled.

pgstef avatar Apr 22 '24 12:04 pgstef

Yes, there's a way. I wrote it in my last comment -> repo-retention-archive. If you don't need PITR capability between the backups in the repository, you can change the archives retention setting. But as also mentioned, mixing this with full backup time retention makes it very tricky.

Ok, let's say I am going to setup retention for longterm_archive backups retention of 1000 days and repo-retention-arhive for 1 backup. Backups are taken weekly. Will I be able to restore from backup older than a week using just longterm-archive repository? Lets assume the first repo disaster_recovery is unavailable (data loss or sth).

Vox1984 avatar Apr 22 '24 12:04 Vox1984

Will I be able to restore from backup older than a week using just longterm-archive repository?

The WAL archives needed for the backup consistency will never get expired. So yes, you will be able to restore your older backups but you'll only be able to perform PITR in the last week. You won't be able to recover between 2 backups.

Using pgbackrest restore --repo=2 --type=immediate --set=X will let you do that. And what's even better is that the archive-get command will work on both repos in priority order: so it will first look for the wal archive in repo1 and if missing, in repo2. So if storage is faster on repo1, you'll get faster recovery too.

To understand what it means to restore a backup with only a few WAL archives (the ones needed for the consistency), you can do some research on Postgres recovery_target='immediate'.

pgstef avatar Apr 22 '24 13:04 pgstef

If disk space is at a premium, then this setting, in conjunction with repo-retention-archive-type, can be used to aggressively expire WAL segments. However, doing so negates the ability to perform PITR from the backups with expired WAL and is therefore not recommended

Sound like my scenario. I want this settings only for my second repo, longterm one. no need to PITR, this repo is just for backup once a week, that's restorable to any time that week, - I don't really care. (law regulation). Because of requirement of storing it for years WAL files would grow too much.

WAL segments required to make a backup consistent are always retained until the backup is expired regardless of how this option is configured.

So pgbackrest automagically would keep the minimum number of WAL files. perfect!

Would such config be valid?

..
## disaster recovery, 28 days retention period
...
repo1-gcs-bucket=gcp-disasterrecovery
repo1-retention-full=28
repo1-retention-full-type=time

## long-term archive, retention of 1000 days
...
repo2-gcs-bucket=gcp-archive
repo2-retention-full=1000
repo2-retention-full-type=time
repo2-retention-archive=1

[my_stanza_name]
pg1-path=/var/lib/postgresql/14/main/

I am aiming to have WAL cleared after 1 day for repo2(would be actually 1 week I suppose because thats how often repo2 backup is taken?) and keep the full backups for 1000 days.

I am asking because the documentation says: https://pgbackrest.org/configuration.html#section-repository/option-repo-retention-archive If repo-retention-full-type is time, then this value will default to removing archives that are earlier than the oldest full backup retained after satisfying the repo-retention-full setting.

So would it really expire WAL files for repo 2 older than 1 week or rather older than 1000 days?

Vox1984 avatar Apr 22 '24 14:04 Vox1984

I am aiming to have WAL cleared after 1 day for repo2(would be actually 1 week I suppose because thats how often repo2 backup is taken?) and keep the full backups for 1000 days.

The expire is based on backups, so unless you have more regular backups, it will probably be 1 week yes.

What you're actually describing is a use-case where you'd have 1 repo to only take backups but don't have archives (just those needed for the backup consistency). That means "disabling archiving on a specific repository". We would like to have that kind of feature, but we're not there yet.

I am asking because the documentation says: https://pgbackrest.org/configuration.html#section-repository/option-repo-retention-archive If repo-retention-full-type is time, then this value will default to removing archives that are earlier than the oldest full backup retained after satisfying the repo-retention-full setting.

That's what I wrote at least twice: reducing the amount of WAL files in a repo is way more complicated when you're using full backup time expiration.

So would it really expire WAL files for repo 2 older than 1 week or rather older than 1000 days?

Honestly, looking at the code, I believe it would work (only keep the WAL archives after the latest backup). But that needs to be tested to really confirm it (and testing time expiration is not that easy tbh).

Regards

pgstef avatar Apr 22 '24 14:04 pgstef

Thanks pgstef, for detailed explanation.

I don't really care if its expiring with time, or with count (as I know the frequency its the same for me - I can recalculate it). I understand this is 2 different pieces of code.

I can test it by setting up above and then see if WAL files were expired indeed (I think such expiration would be logged). Will post results here.

Vox1984 avatar Apr 22 '24 15:04 Vox1984

I can confirm, that after a week I get in the expire.log: 2024-05-05 03:56:39.997 P00 INFO: repo2: 14-1 remove archive, start = 0000000100001ABF, stop = 0000000100001AC000000039

Looks like it works, as the archive in ltsarchive gcs bucket is much smaller than the disasterrecovery one.

What could be better is the message:

  • for criteria not met you get: INFO: repo1: time-based archive retention not met - archive logs will not be expired
  • for criteria met you don't have such a message. Not very convenient for grepping logs. I hoped for will be expired message

I am very happy it works, the saved space for a week is 2 TB just for one DB. Thanks a lot for your help!

Vox1984 avatar May 06 '24 07:05 Vox1984

for criteria met you don't have such a message. Not very convenient for grepping logs. I hoped for will be expired message

Hmm, but I think that's what we do, e.g.

repo1: 9.4-1 remove archive, start = 000000020000000000000006, stop = 000000020000000000000006

dwsteele avatar May 09 '24 00:05 dwsteele