patroni
patroni copied to clipboard
fast leader demotion
Hi,
When patroni leader node fails to update leader lock on DCS he will demote it immediate. In our case this leads to long PG startup and could lead to completely cluster downtime ( if it occurs twice on 2node patroni cluster in timeframe when first demoted node is still not up and ready ). Would it be possible to add configuration parameter (ALWAYS_DEMOTE_FAST) or a tag (fast) that would always use fast mode?
What environment do you run Patroni in? And what DCS do you use?
The major difference between immediate
and fast
shutdown, that during fast shutdown postgres will do its best to make sure that WAL stream is transmitted to all replicas streaming from the master and all remaining WAL files are archived till the very end.
There are following places where the demote
method is called with the 'immediate'
or 'immediate-nolock'
:
-
ha.py#L272-L273 -- master has crashed and
master_start_timeout=0
. Crash recovery is needed anyway, and more important, we want to failover as soon as possible, therefore there is no other option exceptimmediate
. - ha.py#L345 -- postgres is running as the master, but without the leader lock. Probably we could do the fast shutdown here, but we really want to avoid transferring and archiving remaining WALs, because potentially there might be other members streaming from this node.
- ha.py#L476 -- postgres is running as the master and with the leader lock, but watchdog could not be activated. I think it should be safe to do a fast shutdown here.
- ha.py#L862 -- postgres is running as the master, with the leader lock, but we failed to update the leader lock. Using something else than immediate would be less safe here due to the same reasons as in 2
- ha.py#L862 -- reinitialize the node. Since we will remove the data directory here anyway, immediate is absolutely fine here.
- ha.py#L1025 -- the only possibility to get to this place - losing the leader key during the restart of the postgres running as the master. Immediate is the safest choice here.
- ha.py#L1055 -- failed to bootstrap the new cluster. More or less the same reasons as 5.
- ha.py#L1101 -- failed to update the leader lock while the postgres was starting up. In case if we are starting with the recovery.conf in PGDATA, we could use the fast shutdown here. Doing the fast shutdown if postgres is starting up as the master would be not safe.
- ha.py#L1110 -- Reached master_start_timeout. More or less the same situation as 8.
- ha.py#L1177 -- something has happened with the PGDATA (disappeared or become empty). Immediate is the only safest choice here.
Basically in all above-mentioned cases with a few minor exceptions, using the fast shutdown would not be safe or necessary.
First of all, you need to figure out why it failed to update the leader lock. Can it be that you run etcd on the same nodes and especially on the same disks as postgres? Etcd is very critical to disk latencies. If the postgres if heavily utilizing disk it could cause some troubles with etcd.
Thanks for explaining every situation of immediate demotion, this I find very useful and would probably be great to add it in documentation.
We run patroni on test cluster since we are testing it to replace current setup with repmgr. We use etcd as DCS. We simulated situation when etcd cluster is not healthy to test what will happen, so it was demoted in : 4. ha.py#L862
With testing we saw that if etcd cluster is unhealthy, patroni cann't update leader lock and demote leader ( as expected ), but this can lead to complete patroni downtime in case that crash recovery lasts long time. If patroni would stop pg gracefully that it will be much faster.
Regarding reasons for using immediate :
- avoiding transferring and archiving WAL logs -> since patroni runs in loops ( every 10 seconds by default ) it is also possible that for 10 seconds we transfer and archive WAL logs but patroni can't reach etcd ( but we don't know that since connection failed just after last loop has finished ) so additional 1-2 seconds of WAL streeming/archiving ( when fast is used ) wouldn't add so much.
Another thing, would it be possible that patroni doesn't demote master at all when etcd cluster is unhealthy ? When etcd cluster is unhealthy read requests should work and patroni can have all members in cluster info. Would it be ok for patroni to check if other patroni nodes are up and replicas, and if they are then do not demote ? In this way patroni would be more resilient to etcd cluster node failures.
would it be possible that patroni doesn't demote master at all when etcd cluster is unhealthy?
No, it absolutely not safe.
When etcd cluster is unhealthy read requests should work and patroni can have all members in cluster info.
How do you know that it is unhealthy? What if the etcd node is just partitioned? The reason why read requests work - we don't use quorum reads for performance reasons.
Would it be ok for patroni to check if other patroni nodes are up and replicas, and if they are then do not demote?
In distributed world things gets more complicated. If you really want to go that way (use the direct communication between all cluster members), you'll have to reimplement raft/zab/paxos protocol in Patroni, what absolutely doesn't make sense.
In this way patroni would be more resilient to etcd cluster node failures.
I don't really understand why do you think etcd is so unreliable. From our experience of running etcd cluster during more than 3 years, we never experience the real problems due to etcd itself. If you want to improve etcd resilience - just run 5 nodes. It is very-very-very unlikely that 3 nodes out of 5 will go down.
Thanks for explanations! This is helpful. I don't think that etcd is unreliable, I am only projecting network issues, that's the reason for having such issues.
I will try with changes you suggested that could be safe: https://github.com/zalando/patroni/compare/master...tkosgrabar:fast?expand=1
On most points I agree with Alexander's assessment off each of the immediate demotion point. But the one case I would disagree on is the case where updating leader lock fails while running as master. There are 2 main probable causes for hitting this:
- System running Patroni is so slow or was suspended so that leader lock expired before it could be renewed.
- DCS lost quorum.
While neither of them should be normal, it seems to be that 2 is much more likely compared to 1.
When watchdog is active we don't actually have to worry about the leader lock expiring, the watchdog contract is to make the system die before that can happen. So a simple change would be to use fast shutdown when watchdog is active.
When we don't have a watchdog things get more complicated as we have no guarantees. But we could make some reasonable assumptions, for example that the clock is still working properly. Maybe it would make sense to start a background thread that will wait for ttl - safety_margin - time_since_last_lock
and then issue an immediate stop, and in the main thread do fast shutdown with a on_safepoint action to cancel the immediate shutdown. This will definitely have a slightly higher chance of split brain, but the risk may be worth it for people for whom an immediate shutdown takes the node out of commission for an extended period.
I thought of one more possible cause for hitting the failed-update case - someone manually reassinged the lock while HA loop was running. But I think the admin that does that is allowed keep both pieces.
would it be possible that patroni doesn't demote master at all when etcd cluster is unhealthy?
No, it absolutely not safe.
It's not obvious at first why doing nothing unsafe, since some other witness/arbiter HA tools take this approach.
From what I am able to gather the reason is this: without the DCS, Patroni has no other way of determining if the standby databases are still streaming from the leader.
IMO this one could be closed because dcs failsafe mode solves the original issue fails to update leader lock on DCS
.