neon
neon copied to clipboard
in one case replication wasn't able to read WAL (hard to reproduce)
@arssher , @kelvich mentioned that you, probably, have already fixed this issue. Is it true?
Not yet. If I match correctly this issue is about some delay during compute startup when WAL is being downloaded from safekeepers; if subscriber appears before it finishes it might get error. It would be mitigated by using recently implemented on-demand WAL download from safekeepers inside logical walsender.
@save-buffer to figure out next steps with this item cc @arssher
Speaking with Arseny, seems that we essentially need to make an XLogReader-compatible API for neon_walreader. Changes we need to make are:
- Move walproposer state into shmem (specifically, information regarding if walproposer is alive, and its last committed LSN).
- Add custom hooks into
src/backend/replication/walsender.cthat allow you to have custom wal reader. Some potential issues:- walsender uses xlogreader to read WAL, which performs some sort of buffering internally
- neon_walreader API is async, need some sort of wrappers there. Blocking until WAL is read is problematic because
WalSndWaitForWalperforms a bunch of checks while it's waiting that we'd rather not duplicate. To start we'll ignore this and pretend that neon_walreader is sync and won't take too long.
- Use neon_walreader in these hooks to provide custom WAL reader
So basically, I propose we add Walsender_CustomXLogReader_Init callback which we'll use to setup our own XLogReaderRoutine.
Sorry, I have not participated in this conversation so I didn't know arguments why you decided that we need walreader to communicate with walproposer.
I think that @ars has correctly diagnosed the problem. Yes, we now calculate min(slot.restart_lsn) and walproposer at start download requested WAL. The problem is that it is not somehow synchrnonized with other Postgres backends, i.e. walsender which can request some WAL before it is actually uploaded. And yes, there will be no such problem if we have on-demand WAL download. But we do not have it yet. IMHO that instead of "making an XLogReader-compatible API for neon_walreader" it is better to implement on-demand WAL download (sorry it you actually mean the same).
But may be the most straightforward approach to the problem is to implement barrier holding walsenders until all WAL is loaded. For example, something like this happen with normal Postgres recovery at startup: until recovery is completed Postgres doesn't accept any connection requests. In our case it seems to be too much - it is enough to hold only walsenders.
Ok as far as I can tell, this must've been a different bug, because I"m failing to reproduce the conditions that @ars described. I wrote a test here: https://github.com/neondatabase/neon/pull/6872
The test pauses safekeepers from sending WAL to compute nodes, and it checks to make sure that the subscriber just doesn't receive the updates. So it seems like such a barrier already exists? Or my test is broken, which is a possibility, but I've double-checked everything I can think of
OK so update, test does seem to reproduce the issue, in that the logs print
PG:2024-02-27 20:56:03.023 GMT [1810266] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '2', publication_names '"pub"')
PG:2024-02-27 20:56:03.023 GMT [1810266] ERROR: invalid magic number 0000 in log segment 000000010000000000000001, offset 7069696
PG:2024-02-27 20:56:03.023 GMT [1810266] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '2', publication_names '"pub"')
And then subscriber receives this error and repeatedly retries:
2024-02-27 21:07:04.336 UTC [1811447] LOG: logical replication apply worker for subscription "sub1" has started
2024-02-27 21:07:04.339 UTC [1811447] ERROR: could not receive data from WAL stream: ERROR: invalid magic number 0000 in log segment 000000010000000000000001, offset 7069696
2024-02-27 21:07:04.339 UTC [1810248] LOG: background worker "logical replication worker" (PID 1811447) exited with exit code 1
So I will proceed with implementing a fix so that it doesn't return this error
Discussion: we are closing this one.
Reopening because to be clear: the test does reproduce the issue
No update this week
Debugging phase: running tests and fixing issues. After that it needs some cleanup.
On to the last bugs, Neon walreader is reading garbage.
Waiting for review.
Need to address comments on PR
Closing as https://github.com/neondatabase/neon/pull/6872 is merged!