neon in one case replication wasn't able to read WAL (hard to reproduce)

Jan 16 '24 16:01 vadim2404

@arssher , @kelvich mentioned that you, probably, have already fixed this issue. Is it true?

Feb 08 '24 12:02 vadim2404

Not yet. If I match correctly this issue is about some delay during compute startup when WAL is being downloaded from safekeepers; if subscriber appears before it finishes it might get error. It would be mitigated by using recently implemented on-demand WAL download from safekeepers inside logical walsender.

Feb 08 '24 12:02 arssher

@save-buffer to figure out next steps with this item cc @arssher

Feb 20 '24 16:02 vadim2404

Speaking with Arseny, seems that we essentially need to make an XLogReader-compatible API for neon_walreader. Changes we need to make are:

Move walproposer state into shmem (specifically, information regarding if walproposer is alive, and its last committed LSN).
Add custom hooks into src/backend/replication/walsender.c that allow you to have custom wal reader. Some potential issues:
- walsender uses xlogreader to read WAL, which performs some sort of buffering internally
- neon_walreader API is async, need some sort of wrappers there. Blocking until WAL is read is problematic because WalSndWaitForWal performs a bunch of checks while it's waiting that we'd rather not duplicate. To start we'll ignore this and pretend that neon_walreader is sync and won't take too long.
Use neon_walreader in these hooks to provide custom WAL reader

So basically, I propose we add Walsender_CustomXLogReader_Init callback which we'll use to setup our own XLogReaderRoutine.

Feb 20 '24 20:02 save-buffer

Sorry, I have not participated in this conversation so I didn't know arguments why you decided that we need walreader to communicate with walproposer.

I think that @ars has correctly diagnosed the problem. Yes, we now calculate min(slot.restart_lsn) and walproposer at start download requested WAL. The problem is that it is not somehow synchrnonized with other Postgres backends, i.e. walsender which can request some WAL before it is actually uploaded. And yes, there will be no such problem if we have on-demand WAL download. But we do not have it yet. IMHO that instead of "making an XLogReader-compatible API for neon_walreader" it is better to implement on-demand WAL download (sorry it you actually mean the same).

But may be the most straightforward approach to the problem is to implement barrier holding walsenders until all WAL is loaded. For example, something like this happen with normal Postgres recovery at startup: until recovery is completed Postgres doesn't accept any connection requests. In our case it seems to be too much - it is enough to hold only walsenders.

Feb 21 '24 06:02 knizhnik

Ok as far as I can tell, this must've been a different bug, because I"m failing to reproduce the conditions that @ars described. I wrote a test here: https://github.com/neondatabase/neon/pull/6872

The test pauses safekeepers from sending WAL to compute nodes, and it checks to make sure that the subscriber just doesn't receive the updates. So it seems like such a barrier already exists? Or my test is broken, which is a possibility, but I've double-checked everything I can think of

Feb 23 '24 19:02 save-buffer

OK so update, test does seem to reproduce the issue, in that the logs print

PG:2024-02-27 20:56:03.023 GMT [1810266] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '2', publication_names '"pub"')
PG:2024-02-27 20:56:03.023 GMT [1810266] ERROR:  invalid magic number 0000 in log segment 000000010000000000000001, offset 7069696
PG:2024-02-27 20:56:03.023 GMT [1810266] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '2', publication_names '"pub"')

And then subscriber receives this error and repeatedly retries:

2024-02-27 21:07:04.336 UTC [1811447] LOG:  logical replication apply worker for subscription "sub1" has started
2024-02-27 21:07:04.339 UTC [1811447] ERROR:  could not receive data from WAL stream: ERROR:  invalid magic number 0000 in log segment 000000010000000000000001, offset 7069696
2024-02-27 21:07:04.339 UTC [1810248] LOG:  background worker "logical replication worker" (PID 1811447) exited with exit code 1

So I will proceed with implementing a fix so that it doesn't return this error

Feb 29 '24 23:02 save-buffer

Discussion: we are closing this one.

Mar 05 '24 15:03 andreasscherbaum

Reopening because to be clear: the test does reproduce the issue

Mar 05 '24 18:03 save-buffer

No update this week

Mar 13 '24 11:03 andreasscherbaum

Debugging phase: running tests and fixing issues. After that it needs some cleanup.

Mar 19 '24 16:03 andreasscherbaum

On to the last bugs, Neon walreader is reading garbage.

Mar 26 '24 16:03 andreasscherbaum

Waiting for review.

Apr 03 '24 15:04 andreasscherbaum

Need to address comments on PR

Apr 30 '24 15:04 lubennikovaav

Closing as https://github.com/neondatabase/neon/pull/6872 is merged!

May 06 '24 17:05 save-buffer