pg_probackup icon indicating copy to clipboard operation
pg_probackup copied to clipboard

ERROR: WAL segment could not be streamed in 300 seconds

Open deric opened this issue 3 years ago • 2 comments

A full backup with streaming

pg_probackup-13 backup -B /var/lib/postgresql/pgbackup --instance db01 -b FULL --stream --remote-host=db01.example.net --remote-user=postgres -U backup -d backup -j 4 --temp-slot --compress

is failing due to

ERROR: WAL segment 0000000100021E510000002A could not be streamed in 300 seconds

The backup process was running 14 hours:

db01  13       QYUN4Y  ----           FULL  STREAM    1/0  14h:36m  3246GB    0    2.45  21AB4/D07BA908  0/0       ERROR   

According to logs the basebackup was completed, then it was transferring WALs:

2021-09-03 22:51:16 UTC [4511]: VERBOSE: Writing headers for file "base/16403/4117928982.1" offset: 14202299925, len: 690551, crc: 3981487476
2021-09-03 22:51:16 UTC [4511]: VERBOSE: File "/var/lib/postgresql/13/main/base/16403/4117928982.1". Copied 8781824 bytes
2021-09-03 22:51:16 UTC [4511]: LOG: SSH process 4529 is terminated with status 0
2021-09-03 22:51:17 UTC [4511]: VERBOSE: Writing headers for file "base/16403/4117928982.10" offset: 14202990476, len: 690550, crc: 3216108614
2021-09-03 22:51:17 UTC [4511]: VERBOSE: File "/var/lib/postgresql/13/main/base/16403/4117928982.10". Copied 8781824 bytes
2021-09-03 22:51:17 UTC [4511]: LOG: SSH process 4531 is terminated with status 0
2021-09-03 22:51:17 UTC [4511]: INFO: Data files are transferred, time elapsed: 14h:31m
2021-09-03 22:51:17 UTC [4511]: VERBOSE: (query) SET client_min_messages = warning;
2021-09-03 22:51:17 UTC [4511]: VERBOSE: (query) SET datestyle = 'ISO, DMY';
2021-09-03 22:51:17 UTC [4511]: VERBOSE: (query) SELECT pg_catalog.pg_create_restore_point($1)
2021-09-03 22:51:17 UTC [4511]: VERBOSE:        (param:0) = pg_probackup, backup_id QYUN4Y
2021-09-03 22:51:17 UTC [4511]: VERBOSE: (query) SELECT pg_catalog.txid_snapshot_xmax(pg_catalog.txid_current_snapshot()), current_timestamp(0)::timestamptz, lsn, labelfile, spcmapfile FROM pg_catalog.pg_stop_backup(false, false)
2021-09-03 22:51:17 UTC [4511]: VERBOSE: finished segment at 21C4B/B8000000 (timeline 1)
2021-09-03 22:51:17 UTC [4511]: VERBOSE: finished segment at 21C4B/BC000000 (timeline 1)
2021-09-03 22:51:18 UTC [4511]: INFO: wait for pg_stop_backup()
2021-09-03 22:51:18 UTC [4511]: INFO: pg_stop backup() successfully executed
2021-09-03 22:51:18 UTC [4511]: LOG: stop_lsn: 21E51/A8489070
2021-09-03 22:51:18 UTC [4511]: LOG: Looking for LSN 21E51/A8489070 in segment: 0000000100021E510000002A
2021-09-03 22:51:18 UTC [4511]: VERBOSE: finished segment at 21C4B/C0000000 (timeline 1)
2021-09-03 22:51:19 UTC [4511]: VERBOSE: finished segment at 21C4B/C4000000 (timeline 1)
2021-09-03 22:51:19 UTC [4511]: INFO: Wait for LSN 21E51/A8489070 in streamed WAL segment /var/lib/postgresql/pgbackup/backups/psql06c.de/QYUN4Y/database/pg_wal/0000000100021E510000002A
2021-09-03 22:51:19 UTC [4511]: VERBOSE: finished segment at 21C4B/C8000000 (timeline 1)
2021-09-03 22:51:20 UTC [4511]: VERBOSE: finished segment at 21C4B/CC000000 (timeline 1)
2021-09-03 22:51:20 UTC [4511]: VERBOSE: finished segment at 21C4B/D0000000 (timeline 1)
2021-09-03 22:51:21 UTC [4511]: VERBOSE: finished segment at 21C4B/D4000000 (timeline 1)
...
2021-09-03 22:56:16 UTC [4511]: VERBOSE: finished segment at 21C52/BC000000 (timeline 1)
2021-09-03 22:56:17 UTC [4511]: VERBOSE: finished segment at 21C52/C0000000 (timeline 1)
2021-09-03 22:56:17 UTC [4511]: VERBOSE: finished segment at 21C52/C4000000 (timeline 1)
2021-09-03 22:56:18 UTC [4511]: VERBOSE: finished segment at 21C52/C8000000 (timeline 1)
2021-09-03 22:56:19 UTC [4511]: ERROR: WAL segment 0000000100021E510000002A could not be streamed in 300 seconds
2021-09-03 22:56:19 UTC [4511]: WARNING: Backup QYUN4Y is running, setting its status to ERROR

The --archive-timeout option also apply to a STREAM WAL replication?

--archive-timeout=timeout    wait timeout for WAL segment archiving (default: 5min)

Or do I have to increase some timeout on PostgreSQL primary server?

Probably related to #275, but I don't fully understand all comments.

deric avatar Sep 04 '21 08:09 deric

Hello! Looks like pg_probackup fails to stream all required WAL segments within default 5 minutes timeout. Yes, you can use --archive-timeout options to increase the timeout. I would recommend to increase it up to 1h.

gsmolk avatar Sep 04 '21 12:09 gsmolk

Arrgh, right, now it make sense :disappointed:

Maybe the description could be improved a bit. The timeout is specified in minutes or seconds?

I thought it was a timeout to transfer single WAL, but it seems like it's timeout for transferring all remaining WAL files that weren't streamed during the pg_basebackup phase. In my case I have only 1 Gbps network, thus the transfer speed is limited to 100 MB/s... so the WAL streaming might be lagging a bit.

It is possible to disable the timeout completely? Maybe add an alias --wal-timeout or move it from the Archive options section ... it's a bit confusing when it is applied.

deric avatar Sep 04 '21 19:09 deric