neon Add cache for relation persistence

Problem

See https://github.com/neondatabase/neon/issues/12073 and https://neondb.slack.com/archives/C04DGM6SMTM/p1748620049660389

There is race condition in the current unlogged build schema in neon_write with smgr_relpersistence==0 we first check if local file exists and if so, consider that it is unlogged build and call mdwrite to perform local write. But many things can happen between mdexists and mdwritev . For example some other backend can complete unlogged build and unlink this files.

Summary of changes

Add cache for relation kind which can avoid mdexists calls and eliminate race condition at unlogged build end.

Jun 08 '25 15:06 knizhnik

If this PR added a GUC in the Postgres fork or neon extension, please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in making sure this step is done.

Jun 08 '25 15:06 github-actions[bot]

9130 tests run: 8475 passed, 0 failed, 655 skipped (full report)

Flaky tests (1)

Postgres 15

test_replica_query_race: release-arm64-with-lfc

Code coverage* (full report)

functions: 34.7% (8842 of 25482 functions)
lines: 45.7% (71646 of 156719 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
fbd668ee3d0ffa315800367d1ac244173ce847c2 at 2025-07-31T16:31:46.640Z :recycle:}

Jun 08 '25 16:06 github-actions[bot]

@knizhnik , please pay attention to the new test failures produced for 6bcd46b2: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d

regress/regression.diffs

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out	2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out	2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
 -- This case isn't supported, but it should at least EXPLAIN correctly.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
- Limit
-   Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
-         Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR:  lock neon_relkind is not held
+CONTEXT:  writing block 0 of relation base/16384/26093
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 ERROR:  lossy distance functions are not supported in index-only scans
 -- Clean up

Jun 12 '25 07:06 alexanderlaw

@knizhnik , please pay attention to the new test failures produced for 6bcd46b: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d

regress/regression.diffs

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out	2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out	2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
 -- This case isn't supported, but it should at least EXPLAIN correctly.
 explain (verbose, costs off)
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
-                                     QUERY PLAN                                     
-------------------------------------------------------------------------------------
- Limit
-   Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
-   ->  Index Only Scan using gist_tbl_multi_index on public.gist_tbl
-         Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
-         Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR:  lock neon_relkind is not held
+CONTEXT:  writing block 0 of relation base/16384/26093
 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
 ERROR:  lossy distance functions are not supported in index-only scans
 -- Clean up

Thank you for reporting the problem. I hope https://github.com/neondatabase/neon/pull/12166/commits/b81baef2d1d7c55326c472deec3e8bfcfe8e02cf will fix it.

Jun 12 '25 09:06 knizhnik

Removing myself from review, speaking as a storage person, I don't feel qualified to review this. If there's anything in the design that you want reviewed / cross-checked with storage team, please write a mini-rfc explaining what is being cached here, how invalidation works, etc.

Jun 26 '25 17:06 problame

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

When starting unlogged build, look up the entry with get_cached_relkind(), and set the relkind to RELKIND_UNLOGGED_BUILD.
When ending unlogged build, acquire lock, update the relkind in the entry to RELKIND_PERMANENT, release lock, then release the pin on the entry.
in neon_read, call get_cached_relkind to look up and pin the entry. If relkind is RELKIND_UNLOGGED_BUILD`, acquire the LWLock too.

Let's rename relkind_lock to something like finish_unlogged_build_lock or something. That's really the only thing it protects: the moment when a relation goes from RELKIND_UNLOGGED_BUILD to RELKIND_PERMANENT.

Jul 11 '25 15:07 hlinnaka

~~During an unlogged build, I assume most of the writes to happen by the process that's building the index. All those writes will also call get_cached_relkind() and acquire/release the lwlock. Could we bypass easily that for the process building the index? Or is it a premature optimization?~~

Never mind, we only call get_cached_relkind() if the SMgrRelation entry doesn't have the relpersistence. In the process building the index, that should be up-to-date.

Jul 11 '25 15:07 hlinnaka

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

I want to think more about it tomorrow. But right now I just want to share my concerns.

I am not sure that it is correct to have gap between releasing spin lock protecting relkind_hash and obtaining shared lock. I think it can cause race, but I will think more about it tomorrow.
I think that setting the relkind to RELKIND_UNLOGGED_BUILD should be done under spinlock. Otherwise it once again can cause race.
neon_read doesn't need to care about unlogged builds. I think you mixed it with neon_write

Still not sure that moving lock to pagestore_smgr.c is really make logic more straightforward. My intention was to hide all implementation details in relkind_cache. Certainly API is not so simple, but it seems to be in any case better than delegating some locking responsibilities to pagestore_smgr.

Jul 11 '25 19:07 knizhnik

I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:

It is not correct to allow gap between releasing spin lock and granting shared lock in neon_write, because during this gap the backend performing uncloggingg build can complete the build, update status under exclusive lock and then remove relation files. In this case write to the file will fail.

Jul 12 '25 04:07 knizhnik

I address all you comments except two. The main one is your suggestion to move lwlock to pagestore_smgr.c Why it doesn't work?

Backend 1 evict page fro shared buffers and calls neon_write which in turn calls get_cached_relkind get get relkind. Assume that it returns UNLOGGED_BUILD because unlogged build is performed by backend 2. No locks are hold at this moment.
Backend2 completes unlogged build. It obtain exclusive lock and changes relkind to PERMANENT. Then it releases lock. Then it removes relation file on local disk.
Backend1 obtains shared lock and starts writing file. It may file because file was removed by backend1.

How this problem is handled now? get_relkind_cache rechecked relkind after obtaining shared lwlock. If it is still UNLOGGED_BUILD, then backend1 may write file. And lwlock will protect files from been removed by backend2.

Can it be reimplemented in different way? Certainly it can. But I think that it is good to keep all locking logic in one place (relkind_cache) rather than split it between relkind_cache and pagestore_smgr. There are two main aspect which should IMHo be addressed:

Write to the file should not block concurrent backends - so no exclusive lock is acceptable here.
As far as there are two locks: one spin lock to protect hash table and lwlock to protect write to the file - we should be careful to avoid deadlock. Deadlock is now not possible because we can request lwlock under spin lock ,but not opposite.

In principle obtaining lwlock under deadlock is not necessary, because we never can try to evict page of unlogged build index before unlogged build is started. So no race is possible at the beginning of unlogged build. But I do not want to reply on it.

So I have renamed lock to finish_unlogged_build_lock as you proposed, but I do not want to change current locking schema. I do not see now how some alternative implementation can simplify it or increase performance.

Concerning eviction from relkind_cache and relying on HASH_ENTER_NULL - it is less principle. But once again, I do not want to rely on dynahash behaviour - when it report hash overflow and can it affect other hashes (looks like not - for shared hash). But if sometimes we decided to change to partitioned hash, then behaviour of HASH_ENTER_NULL is even more obscure. Current implementation may be a little but paranoid and redundant but it seems to be more predictable and reliable. And it is the same as in relsize_cache. If we want to change it, then it makes sense to create separate PR which change it in both caches.

Jul 12 '25 14:07 knizhnik

I address all you comments except two. The main one is your suggestion to move lwlock to pagestore_smgr.c Why it doesn't work?

Backend 1 evict page fro shared buffers and calls neon_write which in turn calls get_cached_relkind get get relkind. Assume that it returns UNLOGGED_BUILD because unlogged build is performed by backend 2. No locks are hold at this moment.

Backend2 completes unlogged build. It obtain exclusive lock and changes relkind to PERMANENT. Then it releases lock. Then it removes relation file on local disk.

Backend1 obtains shared lock and starts writing file. It may file because file was removed by backend1.

Ok.

How this problem is handled now? get_relkind_cache rechecked relkind after obtaining shared lwlock. If it is still UNLOGGED_BUILD, then backend1 may write file. And lwlock will protect files from been removed by backend2.

Yes that's a fine solution. You can do it that way also if you move the lwlock to pagestore_smgr.c. In neon_write(), in backend 1, after you have acquired the lwlock, recheck that it's still UNLOGGED_BUILD.

But I think that it is good to keep all locking logic in one place (relkind_cache) rather than split it between relkind_cache and pagestore_smgr.

It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.

If you want to still keep it in relkind_cache.c, I'd still still refactoring it in the similar fashion. Instead of having one function that pins and sometimes acquires the lwlock, split those into two separate functions. One function to look up the hash table entry and pin it, and another function to acquire the lwlock.

Jul 12 '25 19:07 hlinnaka

It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.

Ok, moved lock to pagestoresmgr. I still initialise lock in relkind_cache.c because pagestore_smgr doesn't have a function where lock can be allocated (shmem startup hook). But this lock is controlled y pagestore_smgr using three functions fs_shared_lock, fs_exclusive_lock and fs_unlock. May be not the best names may be you can propose something better but finish_unlogged_build_exclusive_lock seems to be overkill:)

New approach requires extra hah lookup but really simplify code and do not require to hand errors in mdexists (entry is not pinned). So I agree with you original version was hard to understand, new one seems to be much more straightforward.

Jul 13 '25 13:07 knizhnik

It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.

Ok, moved lock to pagestoresmgr. I still initialise lock in relkind_cache.c because pagestore_smgr doesn't have a function where lock can be allocated (shmem startup hook).

Hmm, yeah, the init/startup routines are a bit of a mess. I think some cleanup refactoring would be in order, even without this PR.

Current situation:

_PG_init():
  pg_init_libpagestore():
    pagestore_prepare_shmem()
      register shmem_startup_hook = pagestore_shmem_startup_hook
    <define GUCs>
    relsize_hash_init()
      <define GUCs>
      register shmem_startup_hook = neon_smgr_shmem_startup;

  lfc_init();
    <define GUCs>
    register shmem_startup_hook = lfc_shmem_startup

  pg_init_walproposer();
    <define GUCs>
    register shmem_startup_hook = nwp_shmem_startup_hook;

  init_lwlsncache();
    <define GUCs>
    register = shmem_startup_hook = shmeminit;

There is some kind of a pattern here. Most subsystems are initialized by calling pg_something_init() function, which defines GUCs and registers a shmem_startup_hook. But the call to relsize_hash_init() is in a weird place. To fit the general pattern, it should be called directly from _PG_init(). And the function naming is a little inconsistent.

I'd suggest cleaning that up in a separate PR first, and then adding a call to pg_relkind_hash_init() in _PG_init() too.

While we're at it, it might be more clear if we didn't have all the different subsystems register their own shmem_startup_hook. Instead, register just one hook function in _PG_init(), and have that function call the shmem-init functions of all the subsystems. The hook registration is a little messy, with the PG_VERSION_NUM check and all. It'd be nice to have that messiness only once.

But this lock is controlled y pagestore_smgr using three functions fs_shared_lock, fs_exclusive_lock and fs_unlock. May be not the best names may be you can propose something better but finish_unlogged_build_exclusive_lock seems to be overkill:)

:-). fs_* is too short though. It makes me think 'filesystem lock'. I'd suggest removing the fs_* wrapper functions and just calling LWLockAcquire(finish_unlogged_build_lock, LW_EXCLUSIVE) directly.

New approach requires extra hah lookup but really simplify code and do not require to hand errors in mdexists (entry is not pinned). So I agree with you original version was hard to understand, new one seems to be much more straightforward.

Nice!

Jul 14 '25 09:07 hlinnaka

It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.

Ok, moved lock to pagestoresmgr. I still initialise lock in relkind_cache.c because pagestore_smgr doesn't have a function where lock can be allocated (shmem startup hook).

Hmm, yeah, the init/startup routines are a bit of a mess. I think some cleanup refactoring would be in order, even without this PR.

Current situation:
_PG_init():
  pg_init_libpagestore():
    pagestore_prepare_shmem()
      register shmem_startup_hook = pagestore_shmem_startup_hook
    <define GUCs>
    relsize_hash_init()
      <define GUCs>
      register shmem_startup_hook = neon_smgr_shmem_startup;

  lfc_init();
    <define GUCs>
    register shmem_startup_hook = lfc_shmem_startup

  pg_init_walproposer();
    <define GUCs>
    register shmem_startup_hook = nwp_shmem_startup_hook;

  init_lwlsncache();
    <define GUCs>
    register = shmem_startup_hook = shmeminit;
There is some kind of a pattern here. Most subsystems are initialized by calling pg_something_init() function, which defines GUCs and registers a shmem_startup_hook. But the call to relsize_hash_init() is in a weird place. To fit the general pattern, it should be called directly from _PG_init(). And the function naming is a little inconsistent.

I'd suggest cleaning that up in a separate PR first, and then adding a call to pg_relkind_hash_init() in _PG_init() too.

While we're at it, it might be more clear if we didn't have all the different subsystems register their own shmem_startup_hook. Instead, register just one hook function in _PG_init(), and have that function call the shmem-init functions of all the subsystems. The hook registration is a little messy, with the PG_VERSION_NUM check and all. It'd be nice to have that messiness only once.

But this lock is controlled y pagestore_smgr using three functions fs_shared_lock, fs_exclusive_lock and fs_unlock. May be not the best names may be you can propose something better but finish_unlogged_build_exclusive_lock seems to be overkill:)

:-). fs_* is too short though. It makes me think 'filesystem lock'. I'd suggest removing the fs_* wrapper functions and just calling LWLockAcquire(finish_unlogged_build_lock, LW_EXCLUSIVE) directly.

New approach requires extra hah lookup but really simplify code and do not require to hand errors in mdexists (entry is not pinned). So I agree with you original version was hard to understand, new one seems to be much more straightforward.

Nice!

I am +100 for having single shmem_startup_hook and calling from it component's startup functions. Also I agree that relsize_hash_init() should be called from _PG_init(). Should we do it as separate PR?

Concerning replacing fs_* wrapper with LWLockAcquire, I also agree that it makes sense. Will do it.

Jul 14 '25 12:07 knizhnik

I added a commit here, to improve the test:

Switched to using a GIN index over a bigint array column. That's faster to build than an SP-GiST index, making the test faster.
Added assertions for some assumptions made in the test
Reworded comments

Please take a look.

Jul 28 '25 11:07 hlinnaka