wal-g icon indicating copy to clipboard operation
wal-g copied to clipboard

Fix Postgres 12+/Cloudberry concurrent backups.

Open ostinru opened this issue 1 year ago • 0 comments

PostgreSQL / Cloudberry

PostgreSQL 12+ may run exclusive backup and several non-exclusive backups at the same time. We use non-exclusive backups. At the end of backup we call pg_stop_backup and save it in fictional backup_label file that we store in on of the part_xxx.tar files. When WALG_DOWNLOAD_CONCURRENCY != 1, wal-g do the best to extract files as fast as possible. It is likely, that tiny part_xxx.tar with our backup_label will be fetched before huge main backup part_000.tar. IFF main backup has backup_label - we will end up with wrong backup_label and restore will end with requested recovery stop point is before consistent recovery point

Please provide steps to reproduce (if it's a bug)

  # imitate failed backup (label='abc', fast=true, exclusive=true)
  echo "SELECT pg_start_backup('abc', true, true);" | psql

  wal-g backup-push ${PGDATA}

Please add config and wal-g stdout/stderr logs for debug purpose

logs
2024-10-15 15:06:44.801294 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","starting PostgreSQL 14.4 (Cloudberry Database 1.6.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit",,,,,,,0,,"postmaster.c",1370,
2024-10-15 15:06:44.801672 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","listening on IPv4 address ""0.0.0.0"", port 7000",,,,,,,0,,"pqcomm.c",623,
2024-10-15 15:06:44.801683 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","listening on IPv6 address ""::"", port 7000",,,,,,,0,,"pqcomm.c",623,
2024-10-15 15:06:44.804695 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","listening on Unix socket ""/tmp/.s.PGSQL.7000""",,,,,,,0,,"pqcomm.c",618,
2024-10-15 15:06:44.810222 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","database system was interrupted; last known up at 2024-10-15 15:06:10 UTC",,,,,,,0,,"xlog.c",6817,
2024-10-15 15:06:44.810255 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","creating missing WAL directory ""pg_wal/archive_status""",,,,,,,0,,"xlog.c",4382,
2024-10-15 15:06:44.925788 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","starting point-in-time recovery to ""backup_20241015T150608Z""",,,,,,,0,,"xlog.c",6916,
2024-10-15 15:06:45.876270 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","restored log file ""000000010000000000000003"" from archive",,,,,,,0,,"xlogarchive.c",219,
2024-10-15 15:06:45.980452 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","redo starts at 0/C000028",,,,,,,0,,"xlog.c",7675,
2024-10-15 15:06:46.868777 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","restored log file ""000000010000000000000004"" from archive",,,,,,,0,,"xlogarchive.c",219,
2024-10-15 15:06:47.052610 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","restored log file ""000000010000000000000005"" from archive",,,,,,,0,,"xlogarchive.c",219,
2024-10-15 15:06:47.152175 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","recovery stopping at restore point ""backup_20241015T150608Z"", time 2024-10-15 15:06:28.640107+00",,,,,,,0,,"xlog.c",6053,
2024-10-15 15:06:47.162836 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"FATAL","XX000","requested recovery stop point is before consistent recovery point",,,,,,,0,,"xlog.c",7925,"Stack trace:
1    0xffff9aafd8c8 libpostgres.so errstart + 0x260
2    0xffff9a5a963c libpostgres.so StartupXLOG + 0x4f64
3    0xffff9a8d3c88 libpostgres.so StartupProcessMain + 0x108
4    0xffff9a5e87fc libpostgres.so AuxiliaryProcessMain + 0x6d4
5    0xffff9a8d33e8 libpostgres.so PostmasterMain + 0x11f0
6    0xaaaaddba1a08 postgres main + 0x4f8
7    0xffff99ed0e10 libc.so.6 __libc_start_main + 0xe8
8    0xaaaaddba1bb8 postgres <symbol not found> + 0xddba1bb8

ostinru avatar Oct 19 '24 08:10 ostinru