neon Remove XLogWaitForReplayOf at replica to avoid deadlock

Problem

See https://github.com/neondatabase/neon/issues/9955

Waiting for replay can cause deadlock because waiting thread is holding lock on the buffer and doesn't allow redo to proceed.

Summary of changes

Remove call of XLogWaitForReplayOf

Apr 09 '25 12:04 knizhnik

If this PR added a GUC in the Postgres fork or neon extension, please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in making sure this step is done.

Apr 09 '25 12:04 github-actions[bot]

I don't think this is valid, as this could cause issues when changes to some pages now become visible before others, where previously they would become visible at exactly the same time.

E.g. page X and Y are both modified in a record.

With this patch, you can see page X modified but not Y, or Y modified but not X, regardless of the order in which you access them. I think that may cause issues for some data structures that depend on transactional page modifications during replay, such as heap, btree, GIST.

Apr 09 '25 12:04 MMeent

8569 tests run: 7956 passed, 2 failed, 611 skipped (full report)

Failures on Postgres 16

test_backward_compatibility: release-x86-64-with-lfc
test_forward_compatibility: release-x86-64-with-lfc

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_backward_compatibility[release-pg16] or test_forward_compatibility[release-pg16]"

Flaky tests (4)

Postgres 17

test_forward_compatibility: debug-x86-64-without-lfc
test_timeline_retain_lsn[offload-corrupt-True]: debug-x86-64-without-lfc
test_timeline_archival_chaos: release-arm64-with-lfc

Postgres 16

test_ancestor_detach_branched_from[False-True-head]: release-arm64-with-lfc

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
d05cefaae1a8abe1ac28c8c911ccc4786ecae91a at 2025-06-30T08:01:49.172Z :recycle:}

Apr 09 '25 13:04 github-actions[bot]

I don't think this is valid, as this could cause issues when changes to some pages now become visible before others, where previously they would become visible at exactly the same time.

E.g. page X and Y are both modified in a record.

With this patch, you can see page X modified but not Y, or Y modified but not X, regardless of the order in which you access them. I think that may cause issues for some data structures that depend on transactional page modifications during replay, such as heap, btree, GIST.

I wonder what is the difference with Vanilla Postgres here? The only difference of our redo with Vanilla is that it can skip updates of some pages which are not present in shard buffers and LFC. But it updates it LwLSN.

Lemmas:

Compute is accessing pages with request_lsn=LwLSN
LwLSN is updated when dirty page os thrown away from shared buffer or in neon_redo_read_buffer_filter when redo is skipped.
neon_redo_read_buffer_filter is called during redo when WAL record is already decoded and EndRecPtr points to the end of replayed record.

So lets consider that WAL records updating two pages: X and Y. If both X and Y are present in share buffers, then there is no difference in vanilla. Atomicity of page modification is enforce by Postgres using locking.

Let's say that X is present in shared buffer and so updated by redo while Y - not.

If some backend accesses X before it is updated, then it will see old version. If accesses X after it was updated, then it see new version. If it then access Y, then once again it depends on whether Y was already accessed by redo or not. If it is already accessed, then its LwLSN is updated and we get new imagoes the page from PS. If it is not updated, then we got old image. But exactly the same will happen in Vanilla.

So I still do not understand why do we need to wait record redo completion here. It can caused deadlock (which is reproduced) but it is not clear to me which inconsistency it can prevent (comparing with Vanilla).

Apr 09 '25 14:04 knizhnik

The issue is that it is possible for a backend to see modifications on page X that depend on modifications in page Y, but won't see those changes in page Y. Ex:

Given

WAL Record with Lsn(Z) modifies pages X and Y
Replay does modification in X -> Y order, with exclusive locking on (X) -> (X+Y).
Workload reads in X -> Y order

Then it is possible that page X, if not in shared buffers, may be read by a concurrent process at version Lsn(Z), but reads page Y at a version < Lsn(Z); which is inconsistent with vanilla's behaviour.

Timeline:

Redo: ReadPage(X): Missing, set LWLsn(X) = Lsn(Z), release locks
Workload: ReadPage(X): Missing, LWLsn=Lsn(Z), read to buffers
Workload: ReadPage(Y): Present, page lsn < Lsn(Z)
Redo: ReadPage(Y): Present

I find it weird that it's possible for Redo to try to read pages from shmem after it already checked for its presence and found the page missing (hence why it set the LwLSN). Maybe we should try to fix that instead?

Jun 10 '25 10:06 MMeent

The issue is that it is possible for a backend to see modifications on page X that depend on modifications in page Y, but won't see those changes in page Y. Ex:

Given

WAL Record with Lsn(Z) modifies pages X and Y

Replay does modification in X -> Y order, with exclusive locking on (X) -> (X+Y).

Workload reads in X -> Y order

Then it is possible that page X, if not in shared buffers, may be read by a concurrent process at version Lsn(Z), but reads page Y at a version < Lsn(Z); which is inconsistent with vanilla's behaviour.

Timeline:

Redo: ReadPage(X): Missing, set LWLsn(X) = Lsn(Z), release locks

Workload: ReadPage(X): Missing, LWLsn=Lsn(Z), read to buffers

Workload: ReadPage(Y): Present, page lsn < Lsn(Z)

Redo: ReadPage(Y): Present

I find it weird that it's possible for Redo to try to read pages from shmem after it already checked for its presence and found the page missing (hence why it set the LwLSN). Maybe we should try to fix that instead?

So the problem is that Vanilla is holding lock on the accessed pages and so prevents workload from reading page X until Y is also updated, while Neon duo not reconstruct X and do not hold lock on it when it is not present I shared buffers?

It seems to me that the only possible solution is to get buffer and hold lock even if we are not going to reconstruct the page.

Jun 10 '25 11:06 knizhnik

It seems to me that the only possible solution is to get buffer and hold lock even if we are not going to reconstruct the page.

No, I think we should EITHER update LWLsn to Lsn(Z), OR read the page at Lsn(<Z). That way we won't have to wait for the Startup/WAL replay process (unless the WAL redo process already marked the page as "not going to access it", and we'll never hit this deadlock.

Jun 10 '25 12:06 MMeent

I think we should EITHER update LWLsn to Lsn(Z),

But it is what we are doing now, isn't it? See above in your timeline: set LWLsn(X) = Lsn(Z)

OR read the page at Lsn(<Z).

I do not think that it is correct. Assume opposite scenario: X is present in shared buffer, Y - not.

Redo: ReadPage(X): Present, reconstruct page Workload: ReadPage(X): Present, read most recent version of the page Workload: ReadPage(Y): Missed, read page at lsn < Lsn(Z) Redo: ReadPage(Y): Present

May be locking help to prevent such scenario, but I am not sure.

Jun 10 '25 12:06 knizhnik

I find it weird that it's possible for Redo to try to read pages from shmem after it already checked for its presence and found the page missing (hence why it set the LwLSN). Maybe we should try to fix that instead?

I do not think that it is what happen in this case. In the stack trace of redo process, you can see that XLogReadBufferForRedo is called first time for heap_xlog_update.

I think the main problem is that XLogWaitForReplayOf is called for request_lsn. It is not correct because request_lsn==replay_lsn:

			XLogRecPtr	last_written_lsn = last_written_lsns[i];
			if (last_written_lsn > replay_lsn)
			{
				/* GetCurrentReplayRecPtr was introduced in v15 */
#if PG_VERSION_NUM >= 150000
				Assert(last_written_lsn == GetCurrentReplayRecPtr(NULL));
#endif

				/*
				 * Cases 2 and 4. If this is a backend (case 4), the
				 * neon_read_at_lsn() call later will wait for the WAL record to be
				 * fully replayed.
				 */
				result->request_lsn = last_written_lsn;
			}
			else
			{
				/* cases 1 and 3 */
				result->request_lsn = replay_lsn;
			}
			result->not_modified_since = last_written_lsn;

We should rather wait for not_modified_since. But I am not sure that it is enough. Consider the following scenario:

redo process calls heap_xlog_update
It updates visibility map ( XLH_UPDATE_OLD_ALL_VISIBLE_CLEARED flag is set)
VM page is swapped from shared buffer
VM page is swapped from LwLSN cache
workload access page X - it is not present in shared buffers and I LwLSM cache. So it calls neon_read with lwlsn=redo_lsn, and neon_read is blocked in XLogWaitForReplayOf
redo process calls XLogReadBufferForRedo(X) which is blocked waiting for buffer hold by workload backend

Possibility of 3) and especially 4) is very very small. So I am almost sure that if we replace XLogWaitForReplayOf(reqlsns->request_lsn) with XLogWaitForReplayOf(reqlsns->not_modified_since) then nobody will be able to reproduce this deadlock.

But still IMO it can happen with non-zero probability.

Jun 10 '25 14:06 knizhnik

Some correction, scenario described above can not happen, because updating visibility map is not changing page LSN and so can not cause LwLSN=replay_lsn But it still can happen if wal record accesses two or more records, i.e. X and Y. Assumed that CX is not present in shared buffer, so we skip it but assign LwLSN(X)=replay_lsn. Then if page is swapped out from both shared buffers and LwLSN cache, then we can read Y with replay_lsn. It can not happened for heap_xlog_update because it release all buffers at the end of heap_xlog_update. But I am not sure that it can not happen for some other wal record, ie. FPI.

Jun 10 '25 14:06 knizhnik

I think you misunderstand: Right now, we always update LwLsn, and then, optionally, read the page if the page was in shared buffers or LFC.

Instead, the "and then, optionally, " should be replaced with EITHER/OR. If the LwLSN of a page is set to the LSN of a record currently in replay, then we should never try to read the page of that record in the Startup/WAL replay process - we've already determined it wasn't in buffers, so we shouldn't try to read it from buffers again.

This means the LwLSN of a page can be treated as locked-by-replay while the replay process is still working on the record indicated by that LSN, fixing the ordering issue.

(This does require LwLSN cache to never evict in-redo pages, which seems like a fair requirement. At worst we'll have to adapt the LwLSN cache to be read-only in concurrent backends, so that the newest modifications are never evicted.)

Jun 10 '25 14:06 MMeent

Right now, we always update LwLsn, and then, optionally, read the page if the page was in shared buffers or LFC.

It is not true:

	 * we don't have the buffer in memory, update lwLsn past this record, also
	 * evict page from file cache
	 */
	if (no_redo_needed)
	{
		neon_set_lwlsn_block(end_recptr, rinfo, forknum, blkno);
		/*
		 * Redo changes if page exists in LFC.
		 * We should perform this check after assigning LwLSN to prevent
		 * prefetching of some older version of the page by some other backend.
		 */
		no_redo_needed = !lfc_cache_contains(rinfo, forknum, blkno);
	}

So we do not update LwLSN if page is present in shared buffers.

Jun 10 '25 14:06 knizhnik

If the LwLSN of a page is set to the LSN of a record currently in replay, then we should never try to read the page of that record in the Startup/WAL replay process - we've already determined it wasn't in buffers, so we shouldn't try to read it from buffers again.

We actually never do it.

Jun 10 '25 14:06 knizhnik

Right now, we always update LwLsn, and then, optionally, read the page if the page was in shared buffers or LFC.

It is not true: [...] So we do not update LwLSN if page is present in shared buffers.

Then how come we're hitting deadlocks with the active backend waiting for the LSN of the replay process? The redo process doesn't request the same page twice, does it?

Jun 10 '25 14:06 MMeent

So IMO the problem can be fixed by:

Replacing XLogWaitForReplayOf(reqlsns->request_lsn) with XLogWaitForReplayOf(reqlsns->not_modified_since). It is correct in any case.
Add my long ago proposed optimisation than if we receive FPI, then we should always apply its even if page is not present in shard buffer. Looks like it is not only optimisation but protection from LwLSN cache overflow. May be I missed something by applying FPI record is the only place when we pin-unpin buffers in the loop, so it is possible that page X with assigned LwLSN=replay_lsn is evicted from LwLSN cache and cause reading some other page Y with LwLSN=replay_lsn which can cause deadlock.

Jun 10 '25 14:06 knizhnik

Then how come we're hitting deadlocks with the active backend waiting for the LSN of the replay process? T

I explained it above - it is not correct to call XLogWaitForReplayOf(reqlsns->request_lsn) because request_lsn=replay_lsn.

Jun 10 '25 14:06 knizhnik

So whats happen now:

Workload loads page X. Because of XLogWaitForReplayOf(reqlsns->request_lsn) it waits until applying current record is completed.
Redo process wants to update X. It is present in shared buffer (because 1 is loading it). But it has to wait until 1) completes loading. But it in turn waits 2). Deadlock.

Jun 10 '25 14:06 knizhnik

it is not correct to call XLogWaitForReplayOf(reqlsns->request_lsn) because request_lsn=replay_lsn.

But that's correct, as replay_lsn is the LSN of the last completely replayed record (done with), not the one we're currently working on (not: work in progress). So if the request_lsn is replay_lsn, then that should never have to wait for WAL redo - that record has already been replayed.

Jun 10 '25 14:06 MMeent

But that's correct, as replay_lsn is the LSN of the last completely replayed record (done with), not the one we're currently working on (not: work in progress). So if the request_lsn is replay_lsn, then that should never have to wait for WAL redo - that record has already been replayed.

Well, you are right. I missed GetXLogReplayRecPtr with GetCurrentReplayRecPtr.

Jun 10 '25 15:06 knizhnik

But still I do not see any sense in XLogWaitForReplayOf(reqlsns->request_lsn) If request_lsn is GetXLogReplayRecPtr, then this record is already applied and XLogWaitForReplayOf is not needed. If request_lsn is LwLSN, then it is the same as calling XLogWaitForReplayOf(reqlsns->not_modofoed_since)

Jun 10 '25 15:06 knizhnik

The question is how LwLSN of page is managed to be equal to XLogRecoveryCtl->replayEndRecPtr?

I see two possible explanations:

Overflow of LwLSN cache. neon_redo_read_buffer_filter is assigning record->EndRecPtr to relation. Then this entry can be evicted and cause LwLsnCache->maxLastWrittenLsn == replayEndRecPtr
Presence of LFC. If page is not present in shared buffer, but present in LFC, then we updates it LwLSN to EndRecPtr and then tries to load this page. If page at this moment is already thrown away fro LFC, then we will try to load it from PS with request_lsn=replayEndRecPtr

Jun 10 '25 15:06 knizhnik

But still I do not see any sense in XLogWaitForReplayOf(reqlsns->request_lsn)

We have to wait for the "replay lock" on the LwLSN of the page to expire. The LwLSN, which was set to the current WAL record's end LSN by REDO of the page. That saves us from ordering issues. If WAL redo has already gone past that LSN, then this is a no-op.

I suspect the main issue (https://github.com/neondatabase/neon/issues/9955) is that REDO sets the LwLSN for a page X, and that setting that LwLSN updates the global max, which can then be returned for another page's IO when it isn't present in the LwLSN cache (i.e., page Y is not present in LwLSN cache, but requested by another backend, and then the redo process).

Preferably LwLSN cache does not return LSNs of non-replayed records unless the current replay record touches that page.

Ergo, we'll have to figure something out about that.

Jun 10 '25 15:06 MMeent

I suspect the main issue (https://github.com/neondatabase/neon/issues/9955) is that REDO sets the LwLSN for a page X, and that setting that LwLSN updates the global max, which can then be returned for another page's IO when it isn't present in the LwLSN cache (i.e., page Y is not present in LwLSN cache, but requested by another backend, and then the redo process).

Yes, exactly.

Preferably LwLSN cache does not return LSNs of non-replayed records unless the current replay record touches that page.

How it it possible? We specially assign LwLSN=EndRecPtr in neon_redo_read_buffer_filter. And if this entry is evicted than global max will be beyond replayed boundary.

Jun 10 '25 15:06 knizhnik

I suggest to return back to the beginning of the discussion: why do we need XLogWaitForReplayOf? You demonstrated scenario when lack of lock of page X (which is not present in shared buffers and so not updated by redo), can cause inconsistent behaviour with vanilla.

neon_redo_read_buffer_filter is given pointer to XLogReaderState, so we know number of blocks in the applied WAL record. I wonder if we can skip redo pf pages only if all blacks are not present in shared buffers. Otherwise always redo all pages.

Simplified version of this rule: always redo all pages for wal records with multiple blocks.

Jun 10 '25 16:06 knizhnik

Simplified version of this rule: always redo all pages for wal records with multiple blocks.

That would cause IO in redo, and we usually prefer to not do any IO in REDO if we can help it.

I think we'd better make sure that we don't have to wait for evicted LwLSNs, by making LwLSN contain only Redo's changes (i.e. disabling getpage@lsn and neon_write -based LwLSN updates), thus removing any chance of evicting the LwLSN entry of a record during replay of that record.

Jun 10 '25 16:06 MMeent

That would cause IO in redo, and we usually prefer to not do any IO in REDO if we can help it.

It seems to me that there quite few WAL records updating multiple pages.

I think we'd better make sure that we don't have to wait for evicted LwLSNs, by making LwLSN contain only Redo's changes (i.e. disabling getpage@lsn and neon_write -based LwLSN updates), thus removing any chance of evicting the LwLSN entry of a record during replay of that record.

Do I correctly understand that you suggest not to update LwLSN for skipped page and always use GetXLogReplayRecPtr() both for request_lsn and not_modified_since. I afraid that it may cause unnecessary waits of get page because PS can be beyond replica.

Jun 10 '25 16:06 knizhnik

Still do not understand your proposal. Assume WAL records updates two pages X and Y. We need to provide consistent state of this two pages to the backend. In Vanilla it is enforced using locks. If we do not want to redo X (and so load and lock) some page, then doesn't matter with which LSN we request this page from PS. It still may be inconsistent with Y (either older, either younger). Do you agree.

So the only solution I see is to redo both pages or none of them. Or at least pin and lock buffers for both pages. But second one seems to be more complicated than just not skipping any of them.

Jun 10 '25 17:06 knizhnik

@MMeent - I have added to this PR implementation of my proposal: now partial apply of wal records is prohibited. Either all records are applied, either all are skipped.

I do not think that it will cause additional IO in redo. First of all percent of records updating more than one page is relatively small. And even for such wal records probability that we need only one of the affected pages is very small. If working set of primary is the same as working set of standby, then we actually need to reconstruct all records. And if working set are different, then unlikely than working set includes only one of the affected pages.

In any case - as I wrote above, I do not see some other solution.

Jun 11 '25 15:06 knizhnik