barman icon indicating copy to clipboard operation
barman copied to clipboard

receive-wal fails to resume after PITR

Open emz00 opened this issue 5 years ago • 1 comments

barman 2.10, PostgreSQL 12.2, Ubuntu 18.04

I got barman running successfully with config like this

backup_method = rsync
streaming_archiver = on
slot_name = barman
create_slot = auto
archiver = true

After a point-in-time recovery (via barman recover --remote-ssh-command "ssh postgres@PGHOST" --target-time 2020-05-04T15:33 MYSERVER last /pgdata) receive-wal started failing with this:

2020-05-04 15:26:04,129 [27028] barman.server ERROR: Check 'replication slot' failed for server 'MYSERVER'
2020-05-04 15:26:04,312 [27035] barman.wal_archiver INFO: Found 1 xlog segments from streaming for MYSERVER. Archive all segments in one run.
2020-05-04 15:26:04,312 [27035] barman.wal_archiver INFO: Archiving segment 1 of 1 from streaming: MYSERVER/00000002.history
2020-05-04 15:26:04,313 [27035] barman.wal_archiver INFO: No xlog segments found from file archival for MYSERVER.
2020-05-04 15:26:04,317 [27036] barman.server INFO: Starting receive-wal for server MYSERVER
2020-05-04 15:26:04,482 [27036] barman.wal_archiver INFO: Activating WAL archiving through streaming protocol
2020-05-04 15:26:04,556 [27036] barman.command_wrappers INFO: MYSERVER: pg_receivewal: starting log streaming at 0/1C000000 (timeline 2)
2020-05-04 15:26:04,561 [27036] barman.command_wrappers INFO: MYSERVER: pg_receivewal: error: unexpected termination of replication stream: ERROR:  requested WAL segment 00000002000000000000001C has already been removed
2020-05-04 15:26:04,563 [27036] barman.command_wrappers INFO: MYSERVER: pg_receivewal: error: disconnected
2020-05-04 15:26:04,565 [27036] barman.server ERROR: ArchiverFailure:pg_receivexlog terminated with error code: 1
2020-05-04 15:26:04,611 [27028] barman.server ERROR: Check 'receive-wal running' failed for server 'MYSERVER'

I tried barman receive-wal --reset MYSERVER, but this also failed (see #282).

I then renamed barman's data directory for that server and started over, repeating the same test (backup, point-in-time restore) and receive-wal again failed, but with a different error:

2020-05-04 15:36:03,438 [31428] barman.wal_archiver INFO: No xlog segments found from streaming for MYSERVER.
2020-05-04 15:36:03,443 [31428] barman.wal_archiver INFO: Found 1 xlog segments from file archival for MYSERVER. Archive all segments in one run.
2020-05-04 15:36:03,443 [31428] barman.wal_archiver INFO: Archiving segment 1 of 1 from file archival: MYSERVER/00000003.history
2020-05-04 15:36:03,444 [31429] barman.server INFO: Starting receive-wal for server MYSERVER
2020-05-04 15:36:03,744 [31429] barman.wal_archiver INFO: Creating replication slot 'barman'
2020-05-04 15:36:03,747 [31429] barman.server INFO: Creating physical replication slot 'barman' on server 'MYSERVER'
2020-05-04 15:36:03,757 [31429] barman.server INFO: Replication slot 'barman' created
2020-05-04 15:36:03,760 [31429] barman.wal_archiver INFO: Activating WAL archiving through streaming protocol
2020-05-04 15:36:03,913 [31429] barman.command_wrappers INFO: MYSERVER: pg_receivewal: starting log streaming at 0/28000000 (timeline 2)
2020-05-04 15:36:03,919 [31429] barman.command_wrappers INFO: MYSERVER: pg_receivewal: error: could not send replication command "START_REPLICATION": 
2020-05-04 15:36:03,922 [31429] barman.command_wrappers INFO: MYSERVER: pg_receivewal: error: disconnected
2020-05-04 15:36:03,924 [31429] barman.server ERROR: ArchiverFailure:pg_receivexlog terminated with error code: 1

The Postgres server log does not show any errors:

2020-05-04 15:37:03.044 UTC [17240] [unknown]@[unknown] LOG:  connection received: host=BARMAN_IP port=36078
2020-05-04 15:37:03.058 UTC [17240] barman_dev@postgres LOG:  connection authorized: user=barman_dev database=postgres SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2020-05-04 15:37:03.103 UTC [17240] barman_dev@postgres LOG:  disconnection: session time: 0:00:00.059 user=barman_dev database=postgres host=BARMAN_IP port=36078
2020-05-04 15:37:03.103 UTC [17239] barman_streaming_dev@[unknown] LOG:  disconnection: session time: 0:00:00.247 user=barman_streaming_dev database= host=BARMAN_IP port=36076
2020-05-04 15:37:03.194 UTC [17241] [unknown]@[unknown] LOG:  connection received: host=BARMAN_IP port=36080
2020-05-04 15:37:03.217 UTC [17241] barman_streaming_dev@[unknown] LOG:  replication connection authorized: user=barman_streaming_dev application_name=barman_receive_wal SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
2020-05-04 15:37:03.231 UTC [17241] barman_streaming_dev@[unknown] LOG:  disconnection: session time: 0:00:00.037 user=barman_streaming_dev database= host=BARMAN_IP port=36080

emz00 avatar May 04 '20 15:05 emz00

With the fix for #282 applied, receive-wal --reset seemed to fix the issue the first time, but then when I tried to restore and reset receive-wal again it failed:

$ barman receive-wal --reset MYSERVER
ERROR: The receive-wal position is ahead of PostgreSQL current WAL lsn (000000040000000000000029.partial > 000000020000000000000029)

This continued happening until I renamed the streaming sub-directory in the barman data directory. I presume this loses some data, but I don't know. The recovery process in barman is looking rather unreliable right now.

I've had it happen 3 times now. It seems to happen after I do a PITR, then another recovery (whether with or without --target-time). Doesn't happen every time, i.e. sometimes I can do 2-3 restores in a row, but often enough.

emz00 avatar May 05 '20 11:05 emz00