neon Fix recovery IO deadlock

Previously, it was possible for backends to request a page with the LSN of the record currently being replayed. This could cause a deadlock when the redo process wanted to read that same page at the same time.

This LSN could only appear when the page was not present in the LwLSN cache, and the highest evicted LSN also was the LSN of the currently-replayed WAL record.

Problem

https://github.com/neondatabase/neon/issues/9955

Summary of changes

The issue is fixed by splitting maxLastWrittenLsn into two: one for data pages, and one for metadata. This allows us to keep track of metadata changes separately, removing the implicit dependency of page IO on metadata LSNs where appropriate.

Additionally, we stop evicting LwLSNs for pages with an LSN that is yet to be replayed. This means the global data page LwLsn will never return an LSN of a record that has yet to be replayed, unless the startup process has already determined that it won't access that page again, making page IO and Replay waits by other backends using that LSN safe for those pages.

(Note: With this PR, an LSN > ReplayRecPtr might still be found in maxLastWrittenLsnMetadata, but never in maxLastWrittenLsnData)

Counter-offer to #11503

PG PRs:

PG14: https://github.com/neondatabase/postgres/pull/654
PG15: https://github.com/neondatabase/postgres/pull/653
PG16: https://github.com/neondatabase/postgres/pull/655
PG17: https://github.com/neondatabase/postgres/pull/652

Jun 11 '25 20:06 MMeent

If this PR added a GUC in the Postgres fork or neon extension, please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in making sure this step is done.

Jun 11 '25 20:06 github-actions[bot]

8481 tests run: 7819 passed, 79 failed, 583 skipped (full report)

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_compute_catalog[release-pg14] or test_compute_catalog[release-pg14] or test_drop_role_with_table_privileges_from_neon_superuser[release-pg14] or test_drop_role_with_table_privileges_from_neon_superuser[release-pg14] or test_compute_create_drop_dbs_and_roles[release-pg14] or test_compute_create_drop_dbs_and_roles[release-pg14] or test_drop_role_with_table_privileges_from_non_neon_superuser[release-pg14] or test_drop_role_with_table_privileges_from_non_neon_superuser[release-pg14] or test_db_with_custom_settings[release-pg14] or test_db_with_custom_settings[release-pg14] or test_dropdb_with_subscription[release-pg14] or test_dropdb_with_subscription[release-pg14] or test_compute_monitor[release-pg14] or test_compute_monitor[release-pg14] or test_extensions[release-pg14] or test_extensions[release-pg14] or test_lfc_prewarm[release-pg14-compute-ctl] or test_lfc_prewarm[release-pg14-compute-ctl] or test_lfc_prewarm[release-pg14-postgres] or test_lfc_prewarm[release-pg14-postgres] or test_lfc_prewarm_under_workload[release-pg14-compute-ctl] or test_lfc_prewarm_under_workload[release-pg14-compute-ctl] or test_lfc_prewarm_under_workload[release-pg14-postgres] or test_lfc_prewarm_under_workload[release-pg14-postgres] or test_replication_shutdown[release-pg14] or test_replication_shutdown[release-pg14] or test_neon_superuser[release-pg14] or test_neon_superuser[release-pg14] or test_pageserver_lost_and_transaction_committed[release-pg14] or test_pageserver_lost_and_transaction_committed[release-pg14] or test_pageserver_lost_and_transaction_aborted[release-pg14] or test_pageserver_lost_and_transaction_aborted[release-pg14] or test_isolation[release-pg14-v1-None] or test_isolation[release-pg14-v1-None] or test_sql_regress[release-pg14-v2-4] or test_sql_regress[release-pg14-v2-4] or test_isolation[release-pg14-v2-None] or test_isolation[release-pg14-v2-None] or test_sql_regress[release-pg14-v2-None] or test_sql_regress[release-pg14-v2-None] or test_isolation[release-pg14-v1-4] or test_isolation[release-pg14-v1-4] or test_pg_regress[release-pg14-v1-None] or test_pg_regress[release-pg14-v1-None] or test_pg_regress[release-pg14-v2-4] or test_pg_regress[release-pg14-v2-4] or test_pg_regress[release-pg14-v2-None] or test_pg_regress[release-pg14-v2-None] or test_isolation[release-pg14-v2-4] or test_isolation[release-pg14-v2-4] or test_sql_regress[release-pg14-v1-None] or test_sql_regress[release-pg14-v1-None] or test_pg_regress[release-pg14-v1-4] or test_pg_regress[release-pg14-v1-4] or test_sql_regress[release-pg14-v1-4] or test_sql_regress[release-pg14-v1-4] or test_postgres_version[release-pg14] or test_postgres_version[release-pg14] or test_role_grants[release-pg14] or test_role_grants[release-pg14] or test_multiple_subscription_branching[release-pg14] or test_multiple_subscription_branching[release-pg14] or test_subscriber_branching[release-pg14] or test_subscriber_branching[release-pg14] or test_timeline_size_createdropdb[release-pg14] or test_timeline_size_createdropdb[release-pg14] or test_check_visibility_map[release-pg14] or test_check_visibility_map[release-pg14] or test_vm_bit_clear_on_heap_lock_whitebox[release-pg14] or test_vm_bit_clear_on_heap_lock_whitebox[release-pg14] or test_postgres_version[release-pg15] or test_postgres_version[release-pg15] or test_postgres_version[release-pg16] or test_postgres_version[release-pg16] or test_postgres_version[release-pg17] or test_postgres_version[release-pg17] or test_postgres_version[debug-pg17] or test_postgres_version[release-pg17] or test_postgres_version[release-pg17]"

Flaky tests (1)

Postgres 14

test_sharding_split_failures[failure9]: release-arm64-with-lfc

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
cb8782a51a5fca52363c946a2484b2c6319e8d63 at 2025-06-11T21:19:12.506Z :recycle:}

Jun 11 '25 21:06 github-actions[bot]

Sorry, but I still do not understand how this changes will help to prevent this scanario: WAL record is updating pages X. It is not present in shared buffer but present in LFC. So our file updates it's LwLSN and returns false (redo should be applied to the page). Then page is swapped from LFC. And some other backend tries to access this page. Its LwLSN is larger than current applied LSN so neon_write will be blocked in XLogWaitForReplayOf. Then redo handler tries to read page for redo. But buffer is already blocked by backend: deadlock.

Jun 12 '25 06:06 knizhnik

So our file updates it's LwLSN and returns false (redo should be applied to the page). Then page is swapped from LFC.

How is this possible? We only update LwLSN IFF we know we're not going to access the page, right?

Jun 12 '25 08:06 MMeent

How is this possible? We only update LwLSN IFF we know we're not going to access the page, right?

	/*
	 * we don't have the buffer in memory, update lwLsn past this record, also
	 * evict page from file cache
	 */
	if (no_redo_needed)
	{
>>>		neon_set_lwlsn_block(end_recptr, rinfo, forknum, blkno);
		/*
		 * Redo changes if page exists in LFC.
		 * We should perform this check after assigning LwLSN to prevent
		 * prefetching of some older version of the page by some other backend.
		 */
		no_redo_needed = !lfc_cache_contains(rinfo, forknum, blkno);
	}

Jun 12 '25 09:06 knizhnik

One more question: if we prohibit eviction from LwLSN cache entries with LSN>applied_lsn, then why do we need this distinction between data and metadata? Metadata (relation) is now treated just as data (page) with special block number. So if we prohibit its eviction, why do we need to split LwLSN cache he and do all this changes in Postgres submodules, doubling number of callbacks?

Jun 12 '25 13:06 knizhnik

doubling number of callbacks?

What do you mean by that?

Jun 12 '25 13:06 MMeent

doubling number of callbacks?

What do you mean by that?

Sorry, never mind. Clouding of my consciousness. It seems to me that you have added separate callbacks for updating data and metadata LwLSNs.

But still I do not understand why do we need to split data and metadata in LwLSN cache.

Jun 12 '25 13:06 knizhnik

But still I do not understand why do we need to split data and metadata in LwLSN cache.

Because the Startup process won't ever have to wait for other backends that are doing metadata requests, but may have to wait for processes that are doing data requests. So, metadata's global LSN getting ahead of the most recent WAL replay completion timestamp is not that much of an issue, but the data global LSN getting ahead is, because we still need the granularity there.

Jun 12 '25 13:06 MMeent

But still I do not understand why do we need to split data and metadata in LwLSN cache.

Because the Startup process won't ever have to wait for other backends that are doing metadata requests, but may have to wait for processes that are doing data requests. So, metadata's global LSN getting ahead of the most recent WAL replay completion timestamp is not that much of an issue, but the data global LSN getting ahead is, because we still need the granularity there.

I completely agree with it. But if we prohibit eviction of entries from LwLSN cache entries with LSN>replayed_lsn, then there will be no such problem, right? Yes, it is not needed for metadata, but if we do it for metadata also, what will be wrong? Amount of metadata=(number of relations) is much smaller than amount of data and number of "pinned" entries (doesn't matter metadata or data) is very small. So I just wonder - do we need such complication of LwLSN code if there are no obvious benefits.

Jun 12 '25 14:06 knizhnik

But if we prohibit eviction of entries from LwLSN cache entries with LSN>replayed_lsn, then there will be no such problem, right?

We don't store entries for Metadata, instead we immediately update the global lwlsn for those, that's why we need a separate element for that, so that we don't pollute the data pages' lsn with that of metadata.

Jun 12 '25 15:06 MMeent

And my main concern is that you propose alternative way of enforcing data consistency: not based on buffer locking but on waiting until wal record is applied. I failed to find some scenario which demonstrates the problem. But we need to prover that they equivalent, don't we?

Also I am not sure that we still can avoid deadlocks. Consider the following scenario:

Redo process applies WAL record R1 which refers pages P1 and P2.
Redo process checks that P1 is not present in shared buffer, updates LwLSN(P1)=R1 and skips redo.
Some other backend B1 reads page P2
Backend B1 tries to read page P1, waiting until applied_lsn>=R1.
Redo process checks that P2 is present in shared buffer and so want to read it to perform redo. But it is locked by backend B1 -> Deadlock

Jun 12 '25 15:06 knizhnik

I think redo locking P1 -> P2 and backends locking P2 -> P1 will always be sensitive to deadlocks, regardless of this change - we don't have lightweight deadlock detection, only heavyweight deadlock detection.

So any code exists like that, then that would be an issue regardless of these code changes: the 'wait-for-LSN-to-be-replayed' is equivalent to waiting for an exclusive lock on those pages with that LSN, and in that case there is no new locking hazard.

Jun 12 '25 15:06 MMeent

But if we prohibit eviction of entries from LwLSN cache entries with LSN>replayed_lsn, then there will be no such problem, right?

We don't store entries for Metadata, instead we immediately update the global lwlsn for those, that's why we need a separate element for that, so that we don't pollute the data pages' lsn with that of metadata.

So it is not actually just "metadata", because relation size is also metadata, but database metadata. Ok, maintaining separate max value for it certainly makes sense.

Jun 12 '25 18:06 knizhnik

I think redo locking P1 -> P2 and backends locking P2 -> P1 will always be sensitive to deadlocks, regardless of this change - we don't have lightweight deadlock detection, only heavyweight deadlock detection.

So any code exists like that, then that would be an issue regardless of these code changes: the 'wait-for-LSN-to-be-replayed' is equivalent to waiting for an exclusive lock on those pages with that LSN, and in that case there is no new locking hazard.

It sounds reasonable. But I afraid that we baiting for record redo we introduce dependencies which do not exists in vanilla. Assume for example FPI record. It contusion up to 32 blocks.This blocks are restored independently, i.e. Postgres is not holding lock on all this N buffers. So even if redo restored pages in order X, Y and some backend accessing them in order Y, X (and holding lock on Y while loading X), then no deadlock is possible in Vanilla. But in our case it can happen.

Jun 13 '25 05:06 knizhnik

So I have several concerns about your approach:

It is not proven that waiting for end of replay in neon_read is equivalent to locking pages done by Vanilla.
It introduces extra dependencies which can lead to deadlock.
Requires extra spinlock in GetXLogReplayRecPtr (unlikely it can somehow affect performance)
Can cause pinning up to 32 pages (max blocks in wal records). It is 256kb - not so large, but cab be noticeable for shared_buffers=1MB

Jun 13 '25 05:06 knizhnik

Removing myself from review, this change is inside pgxn and doesn't require storage review afaict.

Jun 26 '25 17:06 problame

Fix recovery IO deadlock

Problem

Summary of changes

8481 tests run: 7819 passed, 79 failed, 583 skipped (full report)

Failures on Postgres 17

Failures on Postgres 16

Failures on Postgres 15

Failures on Postgres 14

Postgres 14

Test coverage report is not available