pglogical icon indicating copy to clipboard operation
pglogical copied to clipboard

Fix sync worker hanging on Extension wait event after table resynchronization

Open kmsarabu opened this issue 3 months ago • 4 comments

Problem

After performing pglogical.alter_subscription_resynchronize_table operations, sync workers get stuck waiting on "Extension" wait events indefinitely. The table status shows as catchup(u) but the sync worker remains idle until a dummy transaction is performed on the source database.

Symptoms

  • Sync worker shows "Extension" wait event in pg_stat_activity
  • Table status remains in catchup state despite successful data synchronization
  • Worker only progresses when new WAL activity occurs on source database (e.g., dummy/heartbeat transactions)
  • No error messages, just indefinite waiting

Cause

The issue appears due to a race condition in the sync completion logic:

  1. When a sync worker completes its initial data copy and moves to SYNC_STATUS_SYNCWAIT ('w'), the apply worker in process_syncing_tables() sets the status to SYNC_STATUS_CATCHUP ('u') and updates the replay_stop_lsn().
  2. The sync worker then enters apply_work() to catch up to the target LSN. However, the transition from SYNC_STATUS_CATCHUP ('u') to SYNC_STATUS_SYNCDONE ('y') only happens in the apply worker's handle_commit() function when:
    • The sync worker reaches its replay_stop_lsn
    • AND a commit message is processed
  3. When the publisher database is idle: If the sync worker has already caught up to the target LSN but there are no new transactions (commits) coming from the source database, the status remains in 'u' and never transitions 'y'.

The problem occurs because the sync worker relies on receiving WAL messages to trigger completion checks, but if the source database has no activity, no messages are sent, leaving the worker stuck.

Changes Made

  1. Added completion check in apply_work(): sync workers check if their current LSN has reached or exceeded the target replay_stop_lsn

  2. Clean exit path: When completion is detected, worker updates table status to SYNC_STATUS_SYNCDONE and exits cleanly

kmsarabu avatar Sep 18 '25 06:09 kmsarabu

Thanks @nmisch

Why 5 seconds as the check interval?

  • The 5-second interval balances responsiveness with performance overhead.
  1. Issue reproduced in regression test sql/add_table.sql; this fix also resolves bug #497.
  2. Modified regression script sql/add_table.sql to include sync_status info.

Regression test sql/add_table.sql without this fix:

not ok 11    - add_table                              109352 ms
# 1 of 29 tests failed.
# The differences that caused some tests to fail can be viewed in the file "/home/postgres/work/pglogical/regression_output/regression.diffs".

$ cat /home/postgres/work/pglogical/regression_output/regression.diffs
diff -U3 /home/postgres/work/pglogical/expected/add_table.out /home/postgres/work/pglogical/regression_output/results/add_table.out
--- /home/postgres/work/pglogical/expected/add_table.out        2025-09-26 08:22:51.343316055 +0000
+++ /home/postgres/work/pglogical/regression_output/results/add_table.out       2025-09-26 08:26:02.544659300 +0000
@@ -257,17 +257,13 @@
 BEGIN;
 SET statement_timeout = '60s';
 SELECT pglogical.wait_for_table_sync_complete('test_subscription', 'test_publicschema');
- wait_for_table_sync_complete
-------------------------------
-
-(1 row)
-
+ERROR:  canceling statement due to statement timeout
 COMMIT;
 SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
  sync_kind | sync_subid |   sync_nspname    |    sync_relname    | sync_status | ?column?
 -----------+------------+-------------------+--------------------+-------------+----------
  d         | 3848008564 | public            | test_nosync        | r           | t
- d         | 3848008564 | public            | test_publicschema  | r           | t
+ d         | 3848008564 | public            | test_publicschema  | u           | f
  d         | 3848008564 | strange.schema-IS | test_strangeschema | r           | t
  f         | 3848008564 |                   |                    | r           | t
 (4 rows)
@@ -286,8 +282,8 @@
 -[ RECORD 1 ]---------------
 nspname  | public
 relname  | test_publicschema
-status   | replicating
-?column? | t
+status   | catchup
+?column? | f

 \x
 BEGIN;

Regression test sql/add_table.sql with this fix:

# using temp instance on port 61705 with PID 1132873
ok 1         - preseed                                    46 ms
ok 2         - infofuncs                                  43 ms
ok 3         - init_fail                                  62 ms
ok 4         - init                                      255 ms
ok 5         - preseed_check                              15 ms
ok 6         - basic                                    8074 ms
ok 7         - extended                                44429 ms
ok 8         - conflict_secondary_unique                  21 ms
ok 9         - toasted                                  3030 ms
ok 10        - replication_set                          1028 ms
ok 11        - add_table                               37830 ms
ok 12        - matview                                  4040 ms
ok 13        - bidirectional                            4256 ms
ok 14        - primary_key                             24270 ms
ok 15        - interfaces                               1224 ms
ok 16        - foreign_key                              2022 ms
ok 17        - functions                               11211 ms
ok 18        - copy                                     1021 ms
ok 19        - sequence                                   14 ms
ok 20        - triggers                                 5056 ms
ok 21        - parallel                                 3251 ms
ok 22        - row_filter                              14152 ms
ok 23        - row_filter_sampling                      2021 ms
ok 24        - att_list                                 8093 ms
ok 25        - column_filter                            3040 ms
ok 26        - apply_delay                              7253 ms
ok 27        - multiple_upstreams                       3266 ms
ok 28        - node_origin_cascade                      3273 ms
ok 29        - drop                                     1040 ms
1..29
# All 29 tests passed.

kmsarabu avatar Sep 26 '25 09:09 kmsarabu

At 9b36db7 (first version of pull request) add_table still timed out for me, but at bb85291 and 382d9cb it passes. It's great to have a fix for that.

I see you've made some edits today, removing the 5s interval. I was okay having the interval, and I may be okay having no delay interval. (My initial reaction was that 5s was on the high side, but I haven't formed an opinion on no-delay as an alternative.) How are you thinking about that decision now?

The patch looks good conceptually, so I'll just need to complete a detailed review. Is it ready for that, or do you have more changes planned? If you need to edit it again, please remove the trailing whitespace. git diff --check REL2_x_STABLE can confirm successful removal. Another cosmetic point: a comment still says "Periodic completion check", but the 5s period is gone. If you don't otherwise need to edit the pull request again, I can handle those two cosmetic points.

Your test changes have the test require sync_status='r' in places where it used to accept 'y' or 'r'. I wondered if that would make tests flaky, so I tried to break that by adding sleeps just before the code sets sync_status='r'. That didn't stop the tests from getting 'r', so I'm not going to worry about that further. If the tests prove flaky, we can fix that later. Here's how I checked with sleeps:

diff --git a/pglogical_apply.c b/pglogical_apply.c
index 81a4de4..1588520 100644
--- a/pglogical_apply.c
+++ b/pglogical_apply.c
@@ -1829,2 +1829,3 @@ process_syncing_tables(XLogRecPtr end_lsn)
 			{
+				nmisch_sleep(10, "table SYNC_STATUS_READY normal");
 				sync->status = SYNC_STATUS_READY;
diff --git a/pglogical_sync.c b/pglogical_sync.c
index c7ee4d4..520cd70 100644
--- a/pglogical_sync.c
+++ b/pglogical_sync.c
@@ -42,2 +42,4 @@
 
+#include "libpq/pqsignal.h"
+
 #include "utils/memutils.h"
@@ -96,2 +98,16 @@ static int exec_cmd_win32(const char *cmd, char *cmdargv[]);
 
+void
+nmisch_sleep(int secs, const char *msg)
+{
+	elog(LOG, "nmisch sleeping before %s", msg);
+#ifdef PG_SETMASK
+	PG_SETMASK(&BlockSig);
+	pg_usleep(secs * 1000000);
+	PG_SETMASK(&UnBlockSig);
+#else
+	sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+	pg_usleep(secs * 1000000);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+#endif
+}
 
@@ -991,2 +1007,4 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
 						{
+							nmisch_sleep(10,
+										 "table SYNC_STATUS_READY oldsync");
 							set_table_sync_status(sub->id, remoterel->nspname,
@@ -1000,2 +1018,4 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
 
+							nmisch_sleep(10, "SYNC_KIND_FULL SYNC_STATUS_READY create");
+
 							newsync.kind = SYNC_KIND_FULL;
@@ -1044,2 +1064,3 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
 		/* Nothing to do here yet. */
+		nmisch_sleep(10, "SYNC_KIND_FULL SYNC_STATUS_READY update");
 		status = SYNC_STATUS_READY;
diff --git a/pglogical_sync.h b/pglogical_sync.h
index 51e9cd9..aa7668e 100644
--- a/pglogical_sync.h
+++ b/pglogical_sync.h
@@ -52,2 +52,4 @@ typedef struct PGLogicalSyncStatus
 
+extern void nmisch_sleep(int secs, const char *msg);
+
 extern void pglogical_sync_worker_finish(void);

nmisch avatar Sep 27 '25 21:09 nmisch

Thanks @nmisch for reviewing this PR.

  1. The interval check was originally added as a safety measure, but in practice it proved unnecessary. The sync worker only entered the stuck state after resynchronization had already completed, so an additional periodic check was redundant. Removing it simplifies the loop without changing behavior or safety.

  2. Thanks again and appreciate the review! The patch is ready for the detailed review. I’ve fixed the PR and removed/updated the trailing whitespace and comments.

kmsarabu avatar Oct 19 '25 21:10 kmsarabu

Thanks. This is still in my queue. My schedule is unusually packed right now, so I likely won't be able to do a proper review for another few months, unfortunately. I won't forget it.

nmisch avatar Nov 18 '25 18:11 nmisch