resolve additional deadlock during drop database :: pglogical_apply.c :: CHECK_FOR_INTERRUPTS()
this was tested on top of: PostgreSQL-15.3 + pglogical-2.4.3
resolve additional deadlock during drop database:
fix #418
fix #399
solution for the deadlock in pglogical_apply_main()
is identical to the solution for the deadlock in pglogical_supervisor_main():
https://github.com/2ndQuadrant/pglogical/pull/403
note: there are still few more "deadlock candidates" in the code, for example, search for:
while (!got_SIGTERM)
and review (loops are not using CHECK_FOR_INTERRUPTS()):
pglogical_sync.c
pglogical_sync_worker_cleanup(PGLogicalSubscription *sub)
wait_for_sync_status_change(Oid subid, const char *nspname, const char *relname,
@eulerto please review
I've reviewed this. Note that the linked issues say this began with PostgreSQL 15 commit postgres/postgres@4eb2176318d, but it was Windows-only until postgres/postgres@e2f65f4255. I do recommend merging this PR as-is, but it would be even better with these improvements:
-
Changing
pglogical_sync_worker_cleanupandwait_for_sync_status_changethe same way, even though they're not expected to loop indefinitely likeapply_work -
Normal order of operations is: WaitLatch ResetLatch check WL_POSTMASTER_DEATH CHECK_FOR_INTERRUPTS() normal loop work
See https://github.com/postgres/postgres/blob/b1a8dc846da4d96d903dcb5733f68a1e02d82a23/src/include/storage/latch.h#L41 for some background and https://github.com/postgres/postgres/blob/d392e9bdea957964e1fa6a5481e5adb5904d759a/src/test/modules/worker_spi/worker_spi.c#L238 for an example. This PR is instead putting CHECK_FOR_INTERRUPTS() just before WaitLatch. This is largely harmless, and order does vary within pglogical code. Still, I recommend moving the call to just after the WL_POSTMASTER_DEATH check.
-
Add test coverage, perhaps just:
--- a/sql/parallel.sql +++ b/sql/parallel.sql @@ -34,6 +34,10 @@ SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', ' SELECT * FROM pglogical.show_subscription_status(); +-- apply worker shall not block DROP of a different database +CREATE DATABASE to_be_dropped; +DROP DATABASE to_be_dropped; + -- Make sure we see the slot and active connection \c :provider_dsn SELECT plugin, slot_type, database, active FROM pg_replication_slots; -
Change leading whitespace to match surrounding lines (spaces vs. tabs).
This achieves a lot for a one-line change. @eulerto (since no other person has merged a pglogical commit in the last eight months), would you merge this? If not, what would you like to see first?
@petere I see you are the other of the two people who have committed to pglogical in the last 2y. This pull request achieves a lot for a one-line change. Would you merge this? If not, what would you like to see first?
@eulerto @petere @mwanner This is just adding a CHECK_FOR_INTERRUPTS() so DROP DATABASE does not hang. For 11 months, the pull request has been waiting attention from someone with repository write access. Would you grant me write access so I can take responsibility for merging this? If not, how should we get this process-hang bug resolved?
@nmisch I will check all pending PRs (including this one) in the next few days.
@eulerto This is still just adding one CHECK_FOR_INTERRUPTS(), and it's been waiting 14 months for someone with repository write access. Adding a CHECK_FOR_INTERRUPTS() is one of the safest changes doable in PostgreSQL code. If merging this creates the need for a followup fix, you can count on me to write that followup fix. What is one obstacle to merging it over the next month?