pglogical icon indicating copy to clipboard operation
pglogical copied to clipboard

resolve additional deadlock during drop database :: pglogical_apply.c :: CHECK_FOR_INTERRUPTS()

Open Andrei-Pozolotin opened this issue 2 years ago • 8 comments

this was tested on top of: PostgreSQL-15.3 + pglogical-2.4.3

resolve additional deadlock during drop database: fix #418 fix #399

solution for the deadlock in pglogical_apply_main() is identical to the solution for the deadlock in pglogical_supervisor_main(): https://github.com/2ndQuadrant/pglogical/pull/403

Andrei-Pozolotin avatar Jun 02 '23 20:06 Andrei-Pozolotin

note: there are still few more "deadlock candidates" in the code, for example, search for:

while (!got_SIGTERM)

and review (loops are not using CHECK_FOR_INTERRUPTS()):

pglogical_sync.c

pglogical_sync_worker_cleanup(PGLogicalSubscription *sub)
wait_for_sync_status_change(Oid subid, const char *nspname, const char *relname,

Andrei-Pozolotin avatar Jun 02 '23 20:06 Andrei-Pozolotin

@eulerto please review

Andrei-Pozolotin avatar Jun 02 '23 20:06 Andrei-Pozolotin

I've reviewed this. Note that the linked issues say this began with PostgreSQL 15 commit postgres/postgres@4eb2176318d, but it was Windows-only until postgres/postgres@e2f65f4255. I do recommend merging this PR as-is, but it would be even better with these improvements:

  1. Changing pglogical_sync_worker_cleanup and wait_for_sync_status_change the same way, even though they're not expected to loop indefinitely like apply_work

  2. Normal order of operations is: WaitLatch ResetLatch check WL_POSTMASTER_DEATH CHECK_FOR_INTERRUPTS() normal loop work

    See https://github.com/postgres/postgres/blob/b1a8dc846da4d96d903dcb5733f68a1e02d82a23/src/include/storage/latch.h#L41 for some background and https://github.com/postgres/postgres/blob/d392e9bdea957964e1fa6a5481e5adb5904d759a/src/test/modules/worker_spi/worker_spi.c#L238 for an example. This PR is instead putting CHECK_FOR_INTERRUPTS() just before WaitLatch. This is largely harmless, and order does vary within pglogical code. Still, I recommend moving the call to just after the WL_POSTMASTER_DEATH check.

  3. Add test coverage, perhaps just:

    --- a/sql/parallel.sql
    +++ b/sql/parallel.sql
    @@ -34,6 +34,10 @@ SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', '
    
     SELECT * FROM pglogical.show_subscription_status();
    
    +-- apply worker shall not block DROP of a different database
    +CREATE DATABASE to_be_dropped;
    +DROP DATABASE to_be_dropped;
    +
     -- Make sure we see the slot and active connection
     \c :provider_dsn
     SELECT plugin, slot_type, database, active FROM pg_replication_slots;
    
  4. Change leading whitespace to match surrounding lines (spaces vs. tabs).

nmisch avatar Oct 02 '23 16:10 nmisch

This achieves a lot for a one-line change. @eulerto (since no other person has merged a pglogical commit in the last eight months), would you merge this? If not, what would you like to see first?

nmisch avatar Jan 24 '24 23:01 nmisch

@petere I see you are the other of the two people who have committed to pglogical in the last 2y. This pull request achieves a lot for a one-line change. Would you merge this? If not, what would you like to see first?

nmisch avatar Mar 03 '24 18:03 nmisch

@eulerto @petere @mwanner This is just adding a CHECK_FOR_INTERRUPTS() so DROP DATABASE does not hang. For 11 months, the pull request has been waiting attention from someone with repository write access. Would you grant me write access so I can take responsibility for merging this? If not, how should we get this process-hang bug resolved?

nmisch avatar May 11 '24 15:05 nmisch

@nmisch I will check all pending PRs (including this one) in the next few days.

eulerto avatar May 22 '24 13:05 eulerto

@eulerto This is still just adding one CHECK_FOR_INTERRUPTS(), and it's been waiting 14 months for someone with repository write access. Adding a CHECK_FOR_INTERRUPTS() is one of the safest changes doable in PostgreSQL code. If merging this creates the need for a followup fix, you can count on me to write that followup fix. What is one obstacle to merging it over the next month?

nmisch avatar Aug 23 '24 17:08 nmisch