Add cache for relation persistence
Problem
See https://github.com/neondatabase/neon/issues/12073 and https://neondb.slack.com/archives/C04DGM6SMTM/p1748620049660389
There is race condition in the current unlogged build schema in neon_write with smgr_relpersistence==0 we first check if local file exists and if so, consider that it is unlogged build and call mdwrite to perform local write. But many things can happen between mdexists and mdwritev . For example some other backend can complete unlogged build and unlink this files.
Summary of changes
Add cache for relation kind which can avoid mdexists calls and eliminate race condition at unlogged build end.
If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:
make NEON_WORKDIR=path/to/neon/checkout \
-C goapp/internal/shareddomain/postgres generate
If you're an external contributor, a Neon employee will assist in making sure this step is done.
9130 tests run: 8475 passed, 0 failed, 655 skipped (full report)
Code coverage* (full report)
functions:34.7% (8842 of 25482 functions)lines:45.7% (71646 of 156719 lines)
* collected from Rust tests only
fbd668ee3d0ffa315800367d1ac244173ce847c2 at 2025-07-31T16:31:46.640Z :recycle:
@knizhnik , please pay attention to the new test failures produced for 6bcd46b2: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d
regress/regression.diffs
diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out 2025-06-11 15:57:37.168439715 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out 2025-06-11 16:05:34.917725056 +0000
@@ -376,15 +376,8 @@
-- This case isn't supported, but it should at least EXPLAIN correctly.
explain (verbose, costs off)
select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
- QUERY PLAN
-------------------------------------------------------------------------------------
- Limit
- Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point))
- -> Index Only Scan using gist_tbl_multi_index on public.gist_tbl
- Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point)
- Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point)
-(5 rows)
-
+ERROR: lock neon_relkind is not held
+CONTEXT: writing block 0 of relation base/16384/26093
select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1;
ERROR: lossy distance functions are not supported in index-only scans
-- Clean up
@knizhnik , please pay attention to the new test failures produced for 6bcd46b: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-12166/15589546198/index.html#/testresult/55c01525c6a0084d
regress/regression.diffs
diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out --- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/gist.out 2025-06-11 15:57:37.168439715 +0000 +++ /tmp/test_output/test_pg_regress[release-pg15-v1-4]-1/regress/results/gist.out 2025-06-11 16:05:34.917725056 +0000 @@ -376,15 +376,8 @@ -- This case isn't supported, but it should at least EXPLAIN correctly. explain (verbose, costs off) select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1; - QUERY PLAN ------------------------------------------------------------------------------------- - Limit - Output: p, ((circle(p, '1'::double precision) <-> '(0,0)'::point)) - -> Index Only Scan using gist_tbl_multi_index on public.gist_tbl - Output: p, (circle(p, '1'::double precision) <-> '(0,0)'::point) - Order By: ((circle(gist_tbl.p, '1'::double precision)) <-> '(0,0)'::point) -(5 rows) - +ERROR: lock neon_relkind is not held +CONTEXT: writing block 0 of relation base/16384/26093 select p from gist_tbl order by circle(p,1) <-> point(0,0) limit 1; ERROR: lossy distance functions are not supported in index-only scans -- Clean up
Thank you for reporting the problem. I hope https://github.com/neondatabase/neon/pull/12166/commits/b81baef2d1d7c55326c472deec3e8bfcfe8e02cf will fix it.
Removing myself from review, speaking as a storage person, I don't feel qualified to review this. If there's anything in the design that you want reviewed / cross-checked with storage team, please write a mini-rfc explaining what is being cached here, how invalidation works, etc.
I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:
-
When starting unlogged build, look up the entry with get_cached_relkind(), and set the relkind to
RELKIND_UNLOGGED_BUILD. -
When ending unlogged build, acquire lock, update the relkind in the entry to
RELKIND_PERMANENT, release lock, then release the pin on the entry. -
in neon_read, call get_cached_relkind to look up and pin the entry. If relkind is RELKIND_UNLOGGED_BUILD`, acquire the LWLock too.
Let's rename relkind_lock to something like finish_unlogged_build_lock or something. That's really the only thing it protects: the moment when a relation goes from RELKIND_UNLOGGED_BUILD to RELKIND_PERMANENT.
~~During an unlogged build, I assume most of the writes to happen by the process that's building the index. All those writes will also call get_cached_relkind() and acquire/release the lwlock. Could we bypass easily that for the process building the index? Or is it a premature optimization?~~
Never mind, we only call get_cached_relkind() if the SMgrRelation entry doesn't have the relpersistence. In the process building the index, that should be up-to-date.
I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:
I want to think more about it tomorrow. But right now I just want to share my concerns.
- I am not sure that it is correct to have gap between releasing spin lock protecting relkind_hash and obtaining shared lock. I think it can cause race, but I will think more about it tomorrow.
- I think that setting the relkind to RELKIND_UNLOGGED_BUILD should be done under spinlock. Otherwise it once again can cause race.
neon_readdoesn't need to care about unlogged builds. I think you mixed it withneon_write
Still not sure that moving lock to pagestore_smgr.c is really make logic more straightforward. My intention was to hide all implementation details in relkind_cache. Certainly API is not so simple, but it seems to be in any case better than delegating some locking responsibilities to pagestore_smgr.
I think the locking gets more straightforward, if you move the LWLockAcquire/Release calls out of relkind_cache.c, into pagestore_smgr.c:
It is not correct to allow gap between releasing spin lock and granting shared lock in neon_write, because during this gap the backend performing uncloggingg build can complete the build, update status under exclusive lock and then remove relation files. In this case write to the file will fail.
I address all you comments except two. The main one is your suggestion to move lwlock to pagestore_smgr.c Why it doesn't work?
- Backend 1 evict page fro shared buffers and calls
neon_writewhich in turn callsget_cached_relkindget get relkind. Assume that it returns UNLOGGED_BUILD because unlogged build is performed by backend 2. No locks are hold at this moment. - Backend2 completes unlogged build. It obtain exclusive lock and changes relkind to PERMANENT. Then it releases lock. Then it removes relation file on local disk.
- Backend1 obtains shared lock and starts writing file. It may file because file was removed by backend1.
How this problem is handled now?
get_relkind_cache rechecked relkind after obtaining shared lwlock. If it is still UNLOGGED_BUILD, then backend1 may write file. And lwlock will protect files from been removed by backend2.
Can it be reimplemented in different way? Certainly it can. But I think that it is good to keep all locking logic in one place (relkind_cache) rather than split it between relkind_cache and pagestore_smgr. There are two main aspect which should IMHo be addressed:
- Write to the file should not block concurrent backends - so no exclusive lock is acceptable here.
- As far as there are two locks: one spin lock to protect hash table and lwlock to protect write to the file - we should be careful to avoid deadlock. Deadlock is now not possible because we can request lwlock under spin lock ,but not opposite.
In principle obtaining lwlock under deadlock is not necessary, because we never can try to evict page of unlogged build index before unlogged build is started. So no race is possible at the beginning of unlogged build. But I do not want to reply on it.
So I have renamed lock to finish_unlogged_build_lock as you proposed, but I do not want to change current locking schema. I do not see now how some alternative implementation can simplify it or increase performance.
Concerning eviction from relkind_cache and relying on HASH_ENTER_NULL - it is less principle. But once again, I do not want to rely on dynahash behaviour - when it report hash overflow and can it affect other hashes (looks like not - for shared hash). But if sometimes we decided to change to partitioned hash, then behaviour of HASH_ENTER_NULL is even more obscure. Current implementation may be a little but paranoid and redundant but it seems to be more predictable and reliable. And it is the same as in relsize_cache. If we want to change it, then it makes sense to create separate PR which change it in both caches.
I address all you comments except two. The main one is your suggestion to move lwlock to pagestore_smgr.c Why it doesn't work?
Backend 1 evict page fro shared buffers and calls
neon_writewhich in turn callsget_cached_relkindget get relkind. Assume that it returns UNLOGGED_BUILD because unlogged build is performed by backend 2. No locks are hold at this moment.Backend2 completes unlogged build. It obtain exclusive lock and changes relkind to PERMANENT. Then it releases lock. Then it removes relation file on local disk.
Backend1 obtains shared lock and starts writing file. It may file because file was removed by backend1.
Ok.
How this problem is handled now?
get_relkind_cacherechecked relkind after obtaining shared lwlock. If it is still UNLOGGED_BUILD, then backend1 may write file. And lwlock will protect files from been removed by backend2.
Yes that's a fine solution. You can do it that way also if you move the lwlock to pagestore_smgr.c. In neon_write(), in backend 1, after you have acquired the lwlock, recheck that it's still UNLOGGED_BUILD.
But I think that it is good to keep all locking logic in one place (relkind_cache) rather than split it between relkind_cache and pagestore_smgr.
It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.
If you want to still keep it in relkind_cache.c, I'd still still refactoring it in the similar fashion. Instead of having one function that pins and sometimes acquires the lwlock, split those into two separate functions. One function to look up the hash table entry and pin it, and another function to acquire the lwlock.
It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.
Ok, moved lock to pagestoresmgr. I still initialise lock in relkind_cache.c because pagestore_smgr doesn't have a function where lock can be allocated (shmem startup hook). But this lock is controlled y pagestore_smgr using three functions fs_shared_lock, fs_exclusive_lock and fs_unlock. May be not the best names may be you can propose something better but finish_unlogged_build_exclusive_lock seems to be overkill:)
New approach requires extra hah lookup but really simplify code and do not require to hand errors in mdexists (entry is not pinned). So I agree with you original version was hard to understand, new one seems to be much more straightforward.
It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.
Ok, moved lock to pagestoresmgr. I still initialise lock in relkind_cache.c because pagestore_smgr doesn't have a function where lock can be allocated (shmem startup hook).
Hmm, yeah, the init/startup routines are a bit of a mess. I think some cleanup refactoring would be in order, even without this PR.
Current situation:
_PG_init():
pg_init_libpagestore():
pagestore_prepare_shmem()
register shmem_startup_hook = pagestore_shmem_startup_hook
<define GUCs>
relsize_hash_init()
<define GUCs>
register shmem_startup_hook = neon_smgr_shmem_startup;
lfc_init();
<define GUCs>
register shmem_startup_hook = lfc_shmem_startup
pg_init_walproposer();
<define GUCs>
register shmem_startup_hook = nwp_shmem_startup_hook;
init_lwlsncache();
<define GUCs>
register = shmem_startup_hook = shmeminit;
There is some kind of a pattern here. Most subsystems are initialized by calling pg_something_init() function, which defines GUCs and registers a shmem_startup_hook. But the call to relsize_hash_init() is in a weird place. To fit the general pattern, it should be called directly from _PG_init(). And the function naming is a little inconsistent.
I'd suggest cleaning that up in a separate PR first, and then adding a call to pg_relkind_hash_init() in _PG_init() too.
While we're at it, it might be more clear if we didn't have all the different subsystems register their own shmem_startup_hook. Instead, register just one hook function in _PG_init(), and have that function call the shmem-init functions of all the subsystems. The hook registration is a little messy, with the PG_VERSION_NUM check and all. It'd be nice to have that messiness only once.
But this lock is controlled y pagestore_smgr using three functions fs_shared_lock, fs_exclusive_lock and fs_unlock. May be not the best names may be you can propose something better but
finish_unlogged_build_exclusive_lockseems to be overkill:)
:-). fs_* is too short though. It makes me think 'filesystem lock'. I'd suggest removing the fs_* wrapper functions and just calling LWLockAcquire(finish_unlogged_build_lock, LW_EXCLUSIVE) directly.
New approach requires extra hah lookup but really simplify code and do not require to hand errors in
mdexists(entry is not pinned). So I agree with you original version was hard to understand, new one seems to be much more straightforward.
Nice!
It's really confusing in relkind_cache.c at the moment. I really think it would be more clear to move the lwlock to pagestore_smgr.c.
Ok, moved lock to pagestoresmgr. I still initialise lock in relkind_cache.c because pagestore_smgr doesn't have a function where lock can be allocated (shmem startup hook).
Hmm, yeah, the init/startup routines are a bit of a mess. I think some cleanup refactoring would be in order, even without this PR.
Current situation:
_PG_init(): pg_init_libpagestore(): pagestore_prepare_shmem() register shmem_startup_hook = pagestore_shmem_startup_hook <define GUCs> relsize_hash_init() <define GUCs> register shmem_startup_hook = neon_smgr_shmem_startup; lfc_init(); <define GUCs> register shmem_startup_hook = lfc_shmem_startup pg_init_walproposer(); <define GUCs> register shmem_startup_hook = nwp_shmem_startup_hook; init_lwlsncache(); <define GUCs> register = shmem_startup_hook = shmeminit;There is some kind of a pattern here. Most subsystems are initialized by calling
pg_something_init()function, which defines GUCs and registers a shmem_startup_hook. But the call torelsize_hash_init()is in a weird place. To fit the general pattern, it should be called directly from_PG_init(). And the function naming is a little inconsistent.I'd suggest cleaning that up in a separate PR first, and then adding a call to pg_relkind_hash_init() in _PG_init() too.
While we're at it, it might be more clear if we didn't have all the different subsystems register their own
shmem_startup_hook. Instead, register just one hook function in _PG_init(), and have that function call the shmem-init functions of all the subsystems. The hook registration is a little messy, with the PG_VERSION_NUM check and all. It'd be nice to have that messiness only once.But this lock is controlled y pagestore_smgr using three functions fs_shared_lock, fs_exclusive_lock and fs_unlock. May be not the best names may be you can propose something better but
finish_unlogged_build_exclusive_lockseems to be overkill:):-).
fs_*is too short though. It makes me think 'filesystem lock'. I'd suggest removing thefs_*wrapper functions and just callingLWLockAcquire(finish_unlogged_build_lock, LW_EXCLUSIVE)directly.New approach requires extra hah lookup but really simplify code and do not require to hand errors in
mdexists(entry is not pinned). So I agree with you original version was hard to understand, new one seems to be much more straightforward.Nice!
I am +100 for having single shmem_startup_hook and calling from it component's startup functions.
Also I agree that relsize_hash_init() should be called from _PG_init().
Should we do it as separate PR?
Concerning replacing fs_* wrapper with LWLockAcquire, I also agree that it makes sense. Will do it.
I added a commit here, to improve the test:
- Switched to using a GIN index over a bigint array column. That's faster to build than an SP-GiST index, making the test faster.
- Added assertions for some assumptions made in the test
- Reworded comments
Please take a look.