patroni
patroni copied to clipboard
DCS failsafe mode
If enabled it will allow Patroni to cope with DCS outages. In case of a DCS outage the leader tries to call all remaining members in the cluster via API and if all of them respond with success the leader will not be demoted.
The failsafe_mode could be enabled by running
patronictl edit-config -s failsafe_mode=true
or by calling the /config
REST API endpoint.
Pull Request Test Coverage Report for Build 3910374889
- 133 of 135 (98.52%) changed or added relevant lines in 5 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage increased (+0.002%) to 99.844%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
patroni/ha.py | 109 | 111 | 98.2% |
<!-- | Total: | 133 | 135 |
Totals | |
---|---|
Change from base Build 3900371470: | 0.002% |
Covered Lines: | 11506 |
Relevant Lines: | 11524 |
💛 - Coveralls
Nice! Happy to see this. Should obviate the need for https://github.com/zalando/patroni/pull/2318 as well.
@thedodd yes and no. This implementation is trying to make a difference between the reasons why the update of the leader lock failed and the failsafe mechanism triggers only when we got a network communication exception or some sort of InternalError exception from DCS. Right now it is hard to guarantee that all possible corner cases are covered.
The #2318 simply considers any failed attempt to update the leader lock as a failsafe trigger and should not bring many surprises. Also, it is a very common request to have a Patroni just to handle configuration management (yeah, I know, it defeats the purpose of Patroni :) )
@CyberDem0n just some quick feedback. I've created a docker image to test these changes, and my setup is as follows:
- All code is present as of this branch's latest commit
772fe19
. - I've configured a 2 member cluster, with failsafe_mode=true, running in a K8s cluster.
- When I query their /failsafe endpoint, they return:
"member0": "http://10.22.27.41:8008/patroni", "member1": "http://10.22.10.123:8008/patroni"
- Cluster is stable and has a master & a replica.
If I then delete the K8s RoleBinding which grants Patroni access to the K8s API (for updating endpoints/services/labels &c), it looks like the primary is still demoting.
Technically, the K8s API will have returned a 403 (which I see in the Patroni logs). Does this circumvent the failsafe logic? Upon reviewing the code, it is not immediately obvious to me.
Hi,
Not claiming a review, just a note of playing with it locally (python3 ./patroni.py postgres0.yaml
etc) and it works like a charm!
One note is that as primary does POST /failsafe
to all known members, relying on authentication, we probably want to verify that a primary can POST to all other nodes before enabling failsafe (otherwise there will be a nasty surprise when the actual failsafe is triggered, as seen when adding postgres2.yaml
to the cluster, since it is the only one defining authentication credentials):
2022-08-31 14:48:16,287 INFO: Got response from postgresql2 http://127.0.0.1:8010/patroni: no auth header received
2022-08-31 14:48:16,287 INFO: Got response from postgresql1 http://127.0.0.1:8009/patroni: Accepted
2022-08-31 14:48:16,288 INFO: demoting self because DCS is not accessible and I was a leader
2022-08-31 14:48:16,288 INFO: Demoting self (offline)
If I then delete the K8s RoleBinding which grants Patroni access to the K8s API (for updating endpoints/services/labels &c), it looks like the primary is still demoting. Technically, the K8s API will have returned a 403 (which I see in the Patroni logs). Does this circumvent the failsafe logic? Upon reviewing the code, it is not immediately obvious to me.
Hi @thedodd. The failsafe logic is triggered when DCS doesn't respond in time or when it is asking to repeat the request after some timeout. But, when retrying there could still be the same issue.
In both cases Patroni is giving up after retry_timeout
seconds and the failsafe mode is triggered. The 403 status code only indicates that DCS rejects requests from the current node, but in fact, it is healthy. I can't immediately tell whether it would be safe to use 403 in order to trigger the failsafe mechanism.
@alexeyklyukin yeah, nodes must be able to communicate with each other using credentials and/or client certificates.
Sure, we can call the POST /failsafe
in order to make sure that authentication succeeds, but there are still many questions:
- Should we completely ignore the failsafe mode in this case or just don't add the node to the list?
- On the next iteration the check will be executed again. It will not only spam logs, but cause some increased resource usage.
- What if the check succeeds initially but starts failing later? For example, someone changed credentials on the other node (it is not easy to detect). Or maybe some certificates expired.
@CyberDem0n ok, quick update: by way of manually crippling the K8s API server pods of the testing cluster, I was able to induce the needed downtime scenario. Failsafe mode appears to have held the Patroni cluster steady (1 master, 1 replica). I could connect, reconnect, and continue to use the master as expected. No cluster topology changes during the control plane downtime.
hi, i try use this PR and get error in communicating with DCS, but i can't see that this feature is work
`2022-09-01 01:03:28,018 INFO: Reconnection allowed, looking for another server.
2022-09-01 01:03:28,018 ERROR: Error communicating with DCS
2022-09-01 01:03:28,019 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2022-09-01 01:03:28,020 INFO: demoting self because DCS is not accessible and I was a leader
2022-09-01 01:03:28,020 INFO: Demoting self (offline)
`
What is my mistake? i turn it in patroni.yaml
dcs:
ttl: 30
loop_wait: 10
failsafe_mode: true
retry_timeout : 10
maximum_lag_on_failover: 1048576
What is my mistake?
@anyafit for existing clusters one should use patronictl edit-config
to enable it: https://github.com/zalando/patroni/blob/e54f736651743a2521bdf393f067ea191ae08e19/docs/dcs_failsafe_mode.rst
You can also check whether the feature is enabled using patronictl show-config
.
by way of manually crippling the K8s API server pods of the testing cluster, I was able to induce the needed downtime scenario. Failsafe mode appears to have held the Patroni cluster steady
@thedodd about simple cases I am quite confident that it works. We have behave tests for that, that send SIGSTOP to the localkube process in order to simulate outage.
about simple cases I am quite confident that it works
Awesome! Any particular edge cases you are concerned about that you think might need some additional testing? We talked about the 403 cases earlier, but I'm not concerned about that. I think the implementation as it is right now is doing the correct thing.
@CyberDem0n thoughts on when you would like to land this PR? Waiting on a few other reviews?
@thedodd it should be more sophisticated than just stopping the control plane. For example:
- slow network
- overloaded control-plane
- leader not being able to access the control plane, with replicas not having this problem
Hi @CyberDem0n, any updates on this feature 🤗 ?
@OlleLarsson more people will test it - faster it will be merged and released. Feel free to participate :)
@OlleLarsson more people will test it - faster it will be merged and released. Feel free to participate :)
Perhaps I can find some time. Is there any specific scenario that's extra interesting to test?
@OlleLarsson ideally it should be something close to real problems with DCS.
hi, i have some problem with connect to etcd. It's mean that problem solved too fast?
2022-10-27 20:18:44,385 INFO: Lock owner: patroni_hostname; I am patroni_hostname
2022-10-27 20:18:47,724 ERROR: Request to server http://etcd_hostname:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='etcd_hostname', port=2379): Read timed out. (read timeout=3.333218111967047)",)
2022-10-27 20:18:47,724 INFO: Reconnection allowed, looking for another server.
2022-10-27 20:18:47,724 INFO: Retrying on http://etcd_hostname_2:2379
2022-10-27 20:18:47,907 INFO: Selected new etcd server http://etcd_hostname_2:2379
2022-10-27 20:18:47,908 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2022-10-27 20:18:48,401 INFO: no action. I am (patroni_hostname), the leader with the lock
2022-10-27 20:18:49,577 ERROR: Request to server http://etcd_hostname_2:2379 failed: MaxRetryError("HTTPConnectionPool(host='etcd_hostname_2', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f48d2f91c50>, 'Connection to etcd_hostname_2 timed out. (connect timeout=1.6666666666666667)'))",)
2022-10-27 20:18:49,577 INFO: Reconnection allowed, looking for another server.
2022-10-27 20:18:49,577 INFO: Retrying on http://etcd_hostname:2379
2022-10-27 20:18:49,755 INFO: Selected new etcd server http://etcd_hostname:2379
2022-10-27 20:18:55,924 INFO: no action. I am (patroni_hostname), the leader with the lock
👍
:+1: