backend icon indicating copy to clipboard operation
backend copied to clipboard

Back up PostgreSQL databases

Open pypt opened this issue 2 years ago • 9 comments

Given that disks on which we run our services are pre-shot or at least very old, and the disk health review + replacement is still ongoing, we need to improve PostgreSQL backups by implementing the 3-2-1 strategy.

Current state of affairs is as follows:

  1. First copy of the PostgreSQL dataset: production dataset on woodward. Runs on disks that are still alive but are super old.
  2. Second copy of the PostgreSQL dataset: in -systems repo, we have a zfs-send.py script which used to run once a day and would copy a ZFS snapshot of PostgreSQL data to another server, sinclair. However, we suspended this script due to Holyoke's annual shutdown and haven't resumed its operation since. Plus, sinclair, the server that is being backed up to, has even older disks, at least one of which is shot already, so it's like we
  3. Third copy of the PostgreSQL dataset: largely nonexistent. There's a copy of the dataset from 2019 somewhere, and I have recently manually copied the full dataset to S3, but our data is not being periodically copied anywhere, so if (when) those disks decide to move into eternity (pretty soon now), we'll be able to restore only the dataset that's a few months old.

Therefore, a to-do:

  • [ ] Make zfs-send.py periodically copy PostgreSQL dataset from woodward to sinclair again. Sinclair's disks are unreliable, but at least it's something.
  • [ ] Deploy WAL-G, do a full backup, do a few incremental backups, measure how much time + space do they take.
    • [ ] System load
    • [ ] Time to do full backup
    • [ ] Time to do incremental backup
    • [ ] Requests needed to periodically upload WALs to S3
  • [ ] Increase WAL size so that uploads to S3 don't happen every 5 seconds
  • [ ] Try + document restoring a WAL-G backup, measure time it would take it for us to do it.
  • [ ] Clean up stuff on test S3 buckets.
  • [ ] Make WAL-G do a periodic backup every day or so.

pypt avatar Oct 08 '21 13:10 pypt

Started test full backup to S3.

pypt avatar Oct 09 '21 00:10 pypt

Initial full PgBackRest backup completed in 4238 minutes (~71 hours) and used up ~14 TB of space (compressed with lz4), but unfortunately it turned out that we won't be able to effectively do daily incremental backups with PgBackRest because it only supports file-level and not page-level incremental backups (https://github.com/pgbackrest/pgbackrest/issues/959).

Our database makes ~150 GB worth of changes every day (that's a rough estimate; if you were to zfs snapshot a data volume and wait for a day, the current version of the dataset would be "different by" about 150 GB from the day before). Apparently those changes modify the majority of database's data files (~38,000 files, ~1 GB each), and if any of these files get modified by just a single byte, PgBackRest just uploads the whole file again on an incremental backup. So, for time and money reasons, we can't keep on uploading 14 TB of objects to S3 every three days.

Two tools that claim to support page-level incremental backups are:

I'll try them out.

pypt avatar Oct 13 '21 01:10 pypt

WAL-G took 10837 minutes (181 hours, or almost 8 days) to complete, which is disappointing :(

The backup consists of 28,2k (15.9 TB) files, and in those 8 days PostgreSQL wrote an extra 185k (1.7 TB) WALs.

One thing to try is the somewhat undocumented --turbo flag which disables throttling, but then I don't think that any kind of throttling should be enabled in the first place.

I'll try doing the incremental (delta) backup and see if it's faster.

pypt avatar Oct 20 '21 20:10 pypt

A few incremental backups managed to finish too.

Some stats:

  • full backup:
    • aws s3 ls --summarize --human-readable --recursive s3://mediacloud-postgresql-wal-backups/postgresql-server/basebackups_005/base_000000010004D03400000047/
    • completed in 10,838 minutes (or 180 hours, or 7.5 days)
    • ~28,200 objects
    • ~15,900 GB
    • per hour:
      • ~156 objects
      • ~88 GB
    • per day:
      • ~3,760 objects
      • ~2,120 GB
  • delta backup:
    • aws s3 ls --summarize --human-readable --recursive s3://mediacloud-postgresql-wal-backups/postgresql-server/basebackups_005/base_000000010004D3050000006C_D_000000010004D03400000047/
    • completed in 3,342 minutes (or 56 hours, or 2.4 days)
    • ~650 objects
    • ~384 GB
    • per hour:
      • ~11 objects
      • ~6.8 GB
    • per day:
      • ~279 objects
      • ~164 GB
  • WALs:
    • aws s3 ls --summarize --human-readable --recursive s3://mediacloud-postgresql-wal-backups/postgresql-server/wal_005/
    • period:
      • oldest WAL: 2021-10-13 02:43:38
      • newest WAL: 2021-10-25 14:38:40
      • ~18,000 minutes, or ~300 hours
    • ~283,342 objects
    • ~2,700 GB
    • per hour:
      • ~945 objects
      • ~9 GB
    • per day:
      • ~22,680 objects
      • ~216 GB

Assuming that we'll want to:

  • Always keep at least 2 full backups (as per recommendations)
  • Keep up to 6 delta (incremental backups) on top of each of these full backups (also as it's done in various tutorials)

I've created a little "model" (for the lack of a better term) to visualize how much disk space we'll need to store these backups + WALs:

https://docs.google.com/spreadsheets/d/1Pe0Z12Y3eFzdz2bzwdeovRi2LVmmLueGhEXJYQFvXus/edit#gid=0

Notes:

  • In the "Days" sheet I've modeled what kind of backup / cleanup do we do on any given day and how much storage gets used up by different kinds of objects (full backup, delta backup, WALs)
  • "Graph" graphs cumulative storage use over time
  • In "Providers" I've done a quick and dirty comparison on how much it would cost us to store that backup on various providers; I didn't go with the expensive ones

Some things might be confusing, so feel free to ask away.

Grand conclusion: even if my calculations are off (and they very well might be here and there), B2 is the cheapest option at an average expense of $250/month and $500 for restoring the backup (if we ever need to do it).

@rahulbot, OK for us to start backing up to Backblaze B2?

pypt avatar Oct 26 '21 14:10 pypt

Started initial backup to B2.

pypt avatar Oct 26 '21 21:10 pypt

Initial backup to B2 exited because the PostgreSQL container went away (got killed by someone / something):

<...>
INFO: 2021/11/02 06:54:03.685946 Finished writing part 24382.
INFO: 2021/11/02 06:54:03.686024 Starting part 24383 ...
INFO: 2021/11/02 06:54:35.860215 Finished writing part 24383.
INFO: 2021/11/02 06:54:35.860271 Starting part 24384 ...
INFO: 2021/11/02 06:55:25.657891 Finished writing part 24384.
INFO: 2021/11/02 06:55:25.657943 Starting part 24385 ...
INFO: 2021/11/02 06:56:25.540096 Finished writing part 24385.
INFO: 2021/11/02 06:56:25.540150 Starting part 24386 ...
INFO: 2021/11/02 06:57:09.258935 Finished writing part 24386.
INFO: 2021/11/02 06:57:09.258979 Starting part 24387 ...
INFO: 2021/11/02 06:58:01.548340 Finished writing part 24387.
INFO: 2021/11/02 06:58:01.548397 Starting part 24388 ...
INFO: 2021/11/02 06:58:33.827535 Finished writing part 24388.
INFO: 2021/11/02 06:58:33.827583 Starting part 24389 ...
INFO: 2021/11/02 06:59:08.104628 Finished writing part 24389.
INFO: 2021/11/02 06:59:08.104677 Starting part 24390 ...
INFO: 2021/11/02 06:59:44.330090 Finished writing part 24390.
INFO: 2021/11/02 06:59:44.341975 Starting part 24391 ...
INFO: 2021/11/02 07:00:57.064279 Finished writing part 24391.
INFO: 2021/11/02 07:00:57.064348 Starting part 24392 ...
INFO: 2021/11/02 07:01:14.929094 Finished writing part 24392.
INFO: 2021/11/02 07:01:14.929133 Starting part 24393 ...
INFO: 2021/11/02 07:01:33.575336 Finished writing part 24393.
INFO: 2021/11/02 07:01:33.575543 Starting part 24394 ...
INFO: 2021/11/02 07:01:50.621186 Finished writing part 24394.
INFO: 2021/11/02 07:01:50.621236 Starting part 24395 ...

ansbl@woodward:~$ echo $?
137

$ docker ps | grep postg
ccd320664798   gcr.io/mcback/postgresql-server:release   "/opt/postgresql-ser…"   5 hours ago   Up 5 hours   5432/tcp   mediacloud_postgresql-server.1.oo2eggw7rlrsfbwgcp1czwlv9

at around:

7:00:39 EDT [7-5] LOG:  received fast shutdown request
Nov 2, 2021 @ 07:00:39.2982021-11-02 07:00:39 EDT [7-6] LOG:  aborting any active transactions

Closest relevant messages to that time were Docker complaining about not being able to talk to other nodes:

Nov 2, 2021 @ 07:00:25.657time="2021-11-02T07:00:25.637511290-04:00" level=warning msg="memberlist: Failed to send ack: write tcp 10.200.200.11:7946->10.200.200.33:62250: i/o timeout from=10.200.200.33:62250"
Nov 2, 2021 @ 07:00:25.657time="2021-11-02T07:00:25.637511290-04:00" level=warning msg="memberlist: Failed to send ack: write tcp 10.200.200.11:7946->10.200.200.33:62250: i/o timeout from=10.200.200.33:62250"
Nov 2, 2021 @ 07:00:25.659time="2021-11-02T07:00:25.657016246-04:00" level=error msg="Error in responding to bulk sync from node 10.200.200.23: failed to send a TCP message during bulk sync: write tcp 10.200.200.11:29970->10.200.200.23:7946: write: broken pipe"
Nov 2, 2021 @ 07:00:25.659time="2021-11-02T07:00:25.657016246-04:00" level=error msg="Error in responding to bulk sync from node 10.200.200.23: failed to send a TCP message during bulk sync: write tcp 10.200.200.11:29970->10.200.200.23:7946: write: broken pipe"
Nov 2, 2021 @ 07:00:25.670time="2021-11-02T07:00:25.666327046-04:00" level=warning msg="memberlist: Refuting a suspect message (from: 7d2b28b0e281)"
Nov 2, 2021 @ 07:00:25.670time="2021-11-02T07:00:25.666327046-04:00" level=warning msg="memberlist: Refuting a suspect message (from: 7d2b28b0e281)"
Nov 2, 2021 @ 07:00:25.806time="2021-11-02T07:00:25.792450535-04:00" level=error msg="heartbeat to manager {nxmkl3meudl3g6c9fk9gngn1u 10.200.200.23:2377} failed" error="rpc error: code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe session.id=5grc11x741x198vvoyr63lkgd sessionID=5grc11x741x198vvoyr63lkgd
Nov 2, 2021 @ 07:00:25.806time="2021-11-02T07:00:25.792450535-04:00" level=error msg="heartbeat to manager {nxmkl3meudl3g6c9fk9gngn1u 10.200.200.23:2377} failed" error="rpc error: code = NotFound desc = node not registered" method="(*session).heartbeat" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe session.id=5grc11x741x198vvoyr63lkgd sessionID=5grc11x741x198vvoyr63lkgd

<...>

Nov 2, 2021 @ 07:00:34.922time="2021-11-02T07:00:34.921939971-04:00" level=error msg="agent: session failed" backoff=100ms error="node not registered" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:34.922time="2021-11-02T07:00:34.921939971-04:00" level=error msg="agent: session failed" backoff=100ms error="node not registered" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:35.015time="2021-11-02T07:00:35.015820535-04:00" level=error msg="agent: session failed" backoff=300ms error="rpc error: code = NotFound desc = node not registered" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:35.015time="2021-11-02T07:00:35.015820535-04:00" level=error msg="agent: session failed" backoff=300ms error="rpc error: code = NotFound desc = node not registered" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:35.016time="2021-11-02T07:00:35.016583324-04:00" level=info msg="parsed scheme: \"\"" module=grpc
Nov 2, 2021 @ 07:00:35.016time="2021-11-02T07:00:35.016583324-04:00" level=info msg="parsed scheme: \"\"" module=grpc
Nov 2, 2021 @ 07:00:35.018time="2021-11-02T07:00:35.016781498-04:00" level=info msg="scheme \"\" not registered, fallback to default scheme" module=grpc
Nov 2, 2021 @ 07:00:35.018time="2021-11-02T07:00:35.016781498-04:00" level=info msg="scheme \"\" not registered, fallback to default scheme" module=grpc
Nov 2, 2021 @ 07:00:35.210time="2021-11-02T07:00:35.209609463-04:00" level=info msg="ccResolverWrapper: sending update to cc: {[{10.200.200.24:2377  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Nov 2, 2021 @ 07:00:35.210time="2021-11-02T07:00:35.209682536-04:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Nov 2, 2021 @ 07:00:35.210time="2021-11-02T07:00:35.209609463-04:00" level=info msg="ccResolverWrapper: sending update to cc: {[{10.200.200.24:2377  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Nov 2, 2021 @ 07:00:35.210time="2021-11-02T07:00:35.209682536-04:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Nov 2, 2021 @ 07:00:35.214time="2021-11-02T07:00:35.212410242-04:00" level=info msg="manager selected by agent for new session: {90lt97uu1xpt2vjhe38k99fph 10.200.200.24:2377}" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:35.214time="2021-11-02T07:00:35.212503358-04:00" level=info msg="waiting 145.556634ms before registering session" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:35.214time="2021-11-02T07:00:35.212410242-04:00" level=info msg="manager selected by agent for new session: {90lt97uu1xpt2vjhe38k99fph 10.200.200.24:2377}" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe
Nov 2, 2021 @ 07:00:35.214time="2021-11-02T07:00:35.212503358-04:00" level=info msg="waiting 145.556634ms before registering session" module=node/agent node.id=oo0dosf18648yylrt8u4mkgqe

So I'm guessing network problems at Holyoke? Especially considering that Icinga spent the whole day complaining about lost ping packets.

Cleaned up an old backup, trying to do a new one again.

pypt avatar Nov 02 '21 18:11 pypt

Started a test restore on EC2 to find out how much time it would take us to restore the backup if we ever needed to do it. Should run for about a day.

pypt avatar Nov 02 '21 20:11 pypt

postgresql-server service got killed a few more times when attempting to do a full backup.

Even though there are no signs (that I could find) that it has OOMed, I've still switched the compression method from LZ4 to Brotli (https://github.com/wal-g/wal-g/issues/224). Trying to make a new backup again.

pypt avatar Nov 05 '21 16:11 pypt

Initial full backup restore (without WALs) from S3 took 4636 minutes (3,2 days) on c6g.2xlarge (8 vCPUs, 16 GB of RAM). Restore would have worked considerably faster on an instance with more CPUs, plus EBS seemed to be stalling at various points for whatever reason (ZFS compression?), but then Brotli decompression will be a bit slower too. Lastly, the 16.1 TB dataset got restored to 17.9 TB, probably due to a different ashift ZFS setting on a target partition.

Trying to restore individual WALs on top of a restored full backup.

pypt avatar Nov 07 '21 21:11 pypt