wal-g
wal-g copied to clipboard
Fix Postgres 12+/Cloudberry concurrent backups.
PostgreSQL / Cloudberry
PostgreSQL 12+ may run exclusive backup and several non-exclusive backups at the same time. We use non-exclusive backups. At the end of backup we call pg_stop_backup and save it in fictional backup_label file that we store in on of the part_xxx.tar files.
When WALG_DOWNLOAD_CONCURRENCY != 1, wal-g do the best to extract files as fast as possible. It is likely, that tiny part_xxx.tar with our backup_label will be fetched before huge main backup part_000.tar.
IFF main backup has backup_label - we will end up with wrong backup_label and restore will end with requested recovery stop point is before consistent recovery point
Please provide steps to reproduce (if it's a bug)
# imitate failed backup (label='abc', fast=true, exclusive=true)
echo "SELECT pg_start_backup('abc', true, true);" | psql
wal-g backup-push ${PGDATA}
Please add config and wal-g stdout/stderr logs for debug purpose
logs
2024-10-15 15:06:44.801294 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","starting PostgreSQL 14.4 (Cloudberry Database 1.6.0 build dev) on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit",,,,,,,0,,"postmaster.c",1370,
2024-10-15 15:06:44.801672 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","listening on IPv4 address ""0.0.0.0"", port 7000",,,,,,,0,,"pqcomm.c",623,
2024-10-15 15:06:44.801683 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","listening on IPv6 address ""::"", port 7000",,,,,,,0,,"pqcomm.c",623,
2024-10-15 15:06:44.804695 UTC,,,p4390,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","listening on Unix socket ""/tmp/.s.PGSQL.7000""",,,,,,,0,,"pqcomm.c",618,
2024-10-15 15:06:44.810222 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","database system was interrupted; last known up at 2024-10-15 15:06:10 UTC",,,,,,,0,,"xlog.c",6817,
2024-10-15 15:06:44.810255 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","creating missing WAL directory ""pg_wal/archive_status""",,,,,,,0,,"xlog.c",4382,
2024-10-15 15:06:44.925788 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","starting point-in-time recovery to ""backup_20241015T150608Z""",,,,,,,0,,"xlog.c",6916,
2024-10-15 15:06:45.876270 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","restored log file ""000000010000000000000003"" from archive",,,,,,,0,,"xlogarchive.c",219,
2024-10-15 15:06:45.980452 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","redo starts at 0/C000028",,,,,,,0,,"xlog.c",7675,
2024-10-15 15:06:46.868777 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","restored log file ""000000010000000000000004"" from archive",,,,,,,0,,"xlogarchive.c",219,
2024-10-15 15:06:47.052610 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","restored log file ""000000010000000000000005"" from archive",,,,,,,0,,"xlogarchive.c",219,
2024-10-15 15:06:47.152175 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"LOG","00000","recovery stopping at restore point ""backup_20241015T150608Z"", time 2024-10-15 15:06:28.640107+00",,,,,,,0,,"xlog.c",6053,
2024-10-15 15:06:47.162836 UTC,,,p4392,th-1770865664,,,,0,,,seg-1,,,,,"FATAL","XX000","requested recovery stop point is before consistent recovery point",,,,,,,0,,"xlog.c",7925,"Stack trace:
1 0xffff9aafd8c8 libpostgres.so errstart + 0x260
2 0xffff9a5a963c libpostgres.so StartupXLOG + 0x4f64
3 0xffff9a8d3c88 libpostgres.so StartupProcessMain + 0x108
4 0xffff9a5e87fc libpostgres.so AuxiliaryProcessMain + 0x6d4
5 0xffff9a8d33e8 libpostgres.so PostmasterMain + 0x11f0
6 0xaaaaddba1a08 postgres main + 0x4f8
7 0xffff99ed0e10 libc.so.6 __libc_start_main + 0xe8
8 0xaaaaddba1bb8 postgres <symbol not found> + 0xddba1bb8