Fix sync worker hanging on Extension wait event after table resynchronization
Problem
After performing pglogical.alter_subscription_resynchronize_table operations, sync workers get stuck waiting on "Extension" wait events indefinitely. The table status shows as catchup(u) but the sync worker remains idle until a dummy transaction is performed on the source database.
Symptoms
- Sync worker shows "Extension" wait event in
pg_stat_activity - Table status remains in
catchupstate despite successful data synchronization - Worker only progresses when new WAL activity occurs on source database (e.g., dummy/heartbeat transactions)
- No error messages, just indefinite waiting
Cause
The issue appears due to a race condition in the sync completion logic:
- When a sync worker completes its initial data copy and moves to SYNC_STATUS_SYNCWAIT ('w'), the apply worker in process_syncing_tables() sets the status to SYNC_STATUS_CATCHUP ('u') and updates the replay_stop_lsn().
- The sync worker then enters
apply_work()to catch up to the target LSN. However, the transition from SYNC_STATUS_CATCHUP ('u') to SYNC_STATUS_SYNCDONE ('y') only happens in the apply worker's handle_commit() function when:- The sync worker reaches its replay_stop_lsn
- AND a commit message is processed
- When the publisher database is idle: If the sync worker has already caught up to the target LSN but there are no new transactions (commits) coming from the source database, the status remains in 'u' and never transitions 'y'.
The problem occurs because the sync worker relies on receiving WAL messages to trigger completion checks, but if the source database has no activity, no messages are sent, leaving the worker stuck.
Changes Made
-
Added completion check in
apply_work(): sync workers check if their current LSN has reached or exceeded the targetreplay_stop_lsn -
Clean exit path: When completion is detected, worker updates table status to
SYNC_STATUS_SYNCDONEand exits cleanly
Thanks @nmisch
Why 5 seconds as the check interval?
- The 5-second interval balances responsiveness with performance overhead.
- Issue reproduced in regression test
sql/add_table.sql; this fix also resolves bug #497. - Modified regression script
sql/add_table.sqlto include sync_status info.
Regression test sql/add_table.sql without this fix:
not ok 11 - add_table 109352 ms
# 1 of 29 tests failed.
# The differences that caused some tests to fail can be viewed in the file "/home/postgres/work/pglogical/regression_output/regression.diffs".
$ cat /home/postgres/work/pglogical/regression_output/regression.diffs
diff -U3 /home/postgres/work/pglogical/expected/add_table.out /home/postgres/work/pglogical/regression_output/results/add_table.out
--- /home/postgres/work/pglogical/expected/add_table.out 2025-09-26 08:22:51.343316055 +0000
+++ /home/postgres/work/pglogical/regression_output/results/add_table.out 2025-09-26 08:26:02.544659300 +0000
@@ -257,17 +257,13 @@
BEGIN;
SET statement_timeout = '60s';
SELECT pglogical.wait_for_table_sync_complete('test_subscription', 'test_publicschema');
- wait_for_table_sync_complete
-------------------------------
-
-(1 row)
-
+ERROR: canceling statement due to statement timeout
COMMIT;
SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status, sync_status IN ('y', 'r') FROM pglogical.local_sync_status ORDER BY 2,3,4;
sync_kind | sync_subid | sync_nspname | sync_relname | sync_status | ?column?
-----------+------------+-------------------+--------------------+-------------+----------
d | 3848008564 | public | test_nosync | r | t
- d | 3848008564 | public | test_publicschema | r | t
+ d | 3848008564 | public | test_publicschema | u | f
d | 3848008564 | strange.schema-IS | test_strangeschema | r | t
f | 3848008564 | | | r | t
(4 rows)
@@ -286,8 +282,8 @@
-[ RECORD 1 ]---------------
nspname | public
relname | test_publicschema
-status | replicating
-?column? | t
+status | catchup
+?column? | f
\x
BEGIN;
Regression test sql/add_table.sql with this fix:
# using temp instance on port 61705 with PID 1132873
ok 1 - preseed 46 ms
ok 2 - infofuncs 43 ms
ok 3 - init_fail 62 ms
ok 4 - init 255 ms
ok 5 - preseed_check 15 ms
ok 6 - basic 8074 ms
ok 7 - extended 44429 ms
ok 8 - conflict_secondary_unique 21 ms
ok 9 - toasted 3030 ms
ok 10 - replication_set 1028 ms
ok 11 - add_table 37830 ms
ok 12 - matview 4040 ms
ok 13 - bidirectional 4256 ms
ok 14 - primary_key 24270 ms
ok 15 - interfaces 1224 ms
ok 16 - foreign_key 2022 ms
ok 17 - functions 11211 ms
ok 18 - copy 1021 ms
ok 19 - sequence 14 ms
ok 20 - triggers 5056 ms
ok 21 - parallel 3251 ms
ok 22 - row_filter 14152 ms
ok 23 - row_filter_sampling 2021 ms
ok 24 - att_list 8093 ms
ok 25 - column_filter 3040 ms
ok 26 - apply_delay 7253 ms
ok 27 - multiple_upstreams 3266 ms
ok 28 - node_origin_cascade 3273 ms
ok 29 - drop 1040 ms
1..29
# All 29 tests passed.
At 9b36db7 (first version of pull request) add_table still timed out for me, but at bb85291 and 382d9cb it passes. It's great to have a fix for that.
I see you've made some edits today, removing the 5s interval. I was okay having the interval, and I may be okay having no delay interval. (My initial reaction was that 5s was on the high side, but I haven't formed an opinion on no-delay as an alternative.) How are you thinking about that decision now?
The patch looks good conceptually, so I'll just need to complete a detailed
review. Is it ready for that, or do you have more changes planned? If you
need to edit it again, please remove the trailing whitespace.
git diff --check REL2_x_STABLE can confirm successful removal. Another cosmetic
point: a comment still says "Periodic completion check", but the 5s period is
gone. If you don't otherwise need to edit the pull request again, I can
handle those two cosmetic points.
Your test changes have the test require sync_status='r' in places where it used to accept 'y' or 'r'. I wondered if that would make tests flaky, so I tried to break that by adding sleeps just before the code sets sync_status='r'. That didn't stop the tests from getting 'r', so I'm not going to worry about that further. If the tests prove flaky, we can fix that later. Here's how I checked with sleeps:
diff --git a/pglogical_apply.c b/pglogical_apply.c
index 81a4de4..1588520 100644
--- a/pglogical_apply.c
+++ b/pglogical_apply.c
@@ -1829,2 +1829,3 @@ process_syncing_tables(XLogRecPtr end_lsn)
{
+ nmisch_sleep(10, "table SYNC_STATUS_READY normal");
sync->status = SYNC_STATUS_READY;
diff --git a/pglogical_sync.c b/pglogical_sync.c
index c7ee4d4..520cd70 100644
--- a/pglogical_sync.c
+++ b/pglogical_sync.c
@@ -42,2 +42,4 @@
+#include "libpq/pqsignal.h"
+
#include "utils/memutils.h"
@@ -96,2 +98,16 @@ static int exec_cmd_win32(const char *cmd, char *cmdargv[]);
+void
+nmisch_sleep(int secs, const char *msg)
+{
+ elog(LOG, "nmisch sleeping before %s", msg);
+#ifdef PG_SETMASK
+ PG_SETMASK(&BlockSig);
+ pg_usleep(secs * 1000000);
+ PG_SETMASK(&UnBlockSig);
+#else
+ sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+ pg_usleep(secs * 1000000);
+ sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+#endif
+}
@@ -991,2 +1007,4 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
{
+ nmisch_sleep(10,
+ "table SYNC_STATUS_READY oldsync");
set_table_sync_status(sub->id, remoterel->nspname,
@@ -1000,2 +1018,4 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
+ nmisch_sleep(10, "SYNC_KIND_FULL SYNC_STATUS_READY create");
+
newsync.kind = SYNC_KIND_FULL;
@@ -1044,2 +1064,3 @@ pglogical_sync_subscription(PGLogicalSubscription *sub)
/* Nothing to do here yet. */
+ nmisch_sleep(10, "SYNC_KIND_FULL SYNC_STATUS_READY update");
status = SYNC_STATUS_READY;
diff --git a/pglogical_sync.h b/pglogical_sync.h
index 51e9cd9..aa7668e 100644
--- a/pglogical_sync.h
+++ b/pglogical_sync.h
@@ -52,2 +52,4 @@ typedef struct PGLogicalSyncStatus
+extern void nmisch_sleep(int secs, const char *msg);
+
extern void pglogical_sync_worker_finish(void);
Thanks @nmisch for reviewing this PR.
-
The interval check was originally added as a safety measure, but in practice it proved unnecessary. The sync worker only entered the stuck state after resynchronization had already completed, so an additional periodic check was redundant. Removing it simplifies the loop without changing behavior or safety.
-
Thanks again and appreciate the review! The patch is ready for the detailed review. I’ve fixed the PR and removed/updated the trailing whitespace and comments.
Thanks. This is still in my queue. My schedule is unusually packed right now, so I likely won't be able to do a proper review for another few months, unfortunately. I won't forget it.