Make citus_create_restore_point MX-safe by blocking 2PC commit decisions
DESCRIPTION: Make citus_create_restore_point MX-safe by blocking 2PC commit decisions
Problem:
In coordinator-only mode, citus_create_restore_point() creates consistent restore points by blocking distributed writes at the coordinator level, which is safe because all distributed transactions are coordinated through the coordinator.
However, in MX mode (multi-writer), any worker with metadata can initiate distributed transactions. The existing implementation only blocks writes at the coordinator, allowing metadata workers to continue making 2PC commit decisions. This can result in an inconsistent cluster state where restore points on different nodes represent different transaction visibility.
Solution:
Block distributed transaction commit decisions cluster-wide by acquiring ExclusiveLock on pg_dist_transaction on all metadata nodes (coordinator and MX workers). Additionally, on the coordinator only, lock pg_dist_node and pg_dist_partition to prevent topology and schema changes.
This selective locking strategy is based on the MX mode architecture:
- DDL operations (topology changes, table creation) can ONLY be executed through the coordinator node, even in MX mode
- MX workers can only initiate distributed DML transactions (INSERT/UPDATE/ DELETE) that use 2PC
- Therefore, locking pg_dist_transaction on remote metadata nodes is sufficient to block all distributed writes they can perform, while the coordinator's additional locks on pg_dist_node and pg_dist_partition provide cluster-wide protection against DDL changes
The implementation:
- Opens connections to all nodes (metadata and non-metadata workers)
- Begins coordinated transactions on all remote connections
- Acquires ExclusiveLock on pg_dist_node, pg_dist_partition, and pg_dist_transaction locally on the coordinator via LockRelationOid()
- Acquires ExclusiveLock on pg_dist_transaction on all remote metadata nodes via SQL LOCK TABLE command (executed in parallel)
- Creates restore points on all nodes in parallel (both metadata and non-metadata nodes need WAL restore points)
- Closes remote connections, which releases locks via implicit ROLLBACK
Key Insight - Why No Transaction Drainage Is Needed:
The commit decision in Citus 2PC occurs when LogTransactionRecord() writes to pg_dist_transaction (using RowExclusiveLock for the insert), which happens BEFORE the writer's local commit (in the PRE_COMMIT callback).
By holding ExclusiveLock on pg_dist_transaction:
- Transactions that have already recorded their commit decision (already inserted their row) will complete normally
- Transactions that haven't recorded their commit decision yet will block on the ExclusiveLock (which conflicts with the RowExclusiveLock needed for inserts), preventing them from proceeding
This creates a clean cut point for consistency without requiring us to drain in-flight transactions. The restore point captures the exact state of committed transactions across the cluster.
Recovery Correctness:
The maintenance daemon's recovery logic relies on the presence of pg_dist_transaction records to determine whether to COMMIT PREPARED or ROLLBACK PREPARED. Our blocking ensures that:
- Prepared transactions WITH commit records will be committed on recovery
- Prepared transactions WITHOUT commit records will be rolled back on recovery
Since we create restore points while holding these locks, all nodes capture the same set of commit decisions, ensuring cluster-wide consistency.
Backward Compatibility:
- Return type unchanged: still returns coordinator LSN (pg_lsn)
- Coordinator-only mode: unchanged behavior
- MX mode: automatic detection and enhanced safety (transparent)
- No SQL function signature changes required
Regression Test cases:
TODO
Codecov Report
:x: Patch coverage is 69.23077% with 12 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 88.94%. Comparing base (662b724) to head (0d32050).
:x: Your patch check has failed because the patch coverage (69.23%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files
@@ Coverage Diff @@
## main #8352 +/- ##
==========================================
- Coverage 88.95% 88.94% -0.02%
==========================================
Files 287 287
Lines 63151 63186 +35
Branches 7942 7944 +2
==========================================
+ Hits 56176 56199 +23
- Misses 4667 4672 +5
- Partials 2308 2315 +7
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.