Ensure a given Electric instance holds the lock to operate on publication manager
Versions
- Electric: a057f9c7dc07743c8be687ba5e45ce17fb9172db
Bug description At any point our lock connection might get lost, due to network partitions or anything else, and another Electric B waiting (e.g. in a rolling deploy) can immediately grab it and start operating on the slot and publication.
When the lock connection is lost in Electric A, until everything gets shut down and restarted, it might still be operating on the slot and publication without a lock, causing inconsistencies across the two Electrics.
Expected behavior No modifying operations should be made by an Electric if it's not holding the lock.
Suggested solution
- Replication connection modifications
- Solved by moving the lock inside the replication connection https://github.com/electric-sql/stratovolt/issues/811
- Replication slot dropping on cleanup
- Since we do this in the connection manager, we can ensure the lock is alive before doing that, otherwise the same solution as for the publication manager below
- Publication manager modifications
- Before any modification to the publication, check that 1) the lock is active, and 2) it is owned by the stack itself, either by matching the backend pid or investigate if the advisory lock can be given some extra metadata to match on.
Before any modification to the publication, check that 1) the lock is active, and 2) it is owned by the stack itself, either by matching the backend pid or investigate if the advisory lock can be given some extra metadata to match on.
This doesn't prevent the case where the lock is dropped after the check and acquired by a different instance of Electric.
A fool-proof way to achieve mutual exclusion would be to make any publication modifications in a transaction and acquire the lock once again for the duration of the transaction. From PostgreSQL docs:
A lock can be acquired multiple times by its owning process; for each completed lock request there must be a corresponding unlock request before the lock is actually released. Transaction-level lock requests, on the other hand, behave more like regular lock requests: they are automatically released at the end of the transaction, and there is no explicit unlock operation. This behavior is often more convenient than the session-level behavior for short-term usage of an advisory lock.
If a session already holds a given advisory lock, additional requests by it will always succeed, even if other sessions are awaiting the lock; this statement is true regardless of whether the existing lock hold and new request are at session level or transaction level.
The difficulty in our case is that we have a replication connection holding the lock but we want to update the publication in a pooled connection, i.e. in a different session from PG's point of view. We need to be really careful about the locking strategy here but I would argue it has to rely on obtaining some sort of lock for publication changes instead of just checking a lock status before performing the modifications.
One solution is to acquire a 2nd, transaction-level lock in the same transaction where publication changes are being performed. This at least ensures that another Electric won't try to modify the publication concurrently, at the cost of it failing to perform its update if the old instance hasn't yet committed the transaction.
To be completely sure that there's no other instance modifying the publication, we can acquire the publication lock right after the exclusive lock and then release it. This will make the new instance wait until the old one is done with the publication before proceeding to initialize itself.
@alco just checking if what I'm thinking based on what you're saying is the same as what you're proposing, but we could do a mixed approach where on every publication update we grab a publication-specific transaction lock, and then check if we have the global advisory lock. If we do, we proceed with the update, and if we don't we abort it.
This way even if the advisory lock is lost after the check, whoever owns the advisory lock will have to wait on the transaction lock to be released to modify the publication, and after it is released the previous electric would not be able to grab it again because it does not have the advisory lock.
@msfstef yes exactly.
Regarding this last part:
after it is released the previous electric would not be able to grab it again because it does not have the advisory lock.
When the lock connection dies, it also brings down the connection pool. So the contention interval between the old and the new Electric is as long as it takes for the exit signals to propagate between Elixir processes.