patroni DCS failsafe mode

If enabled it will allow Patroni to cope with DCS outages. In case of a DCS outage the leader tries to call all remaining members in the cluster via API and if all of them respond with success the leader will not be demoted.

The failsafe_mode could be enabled by running

patronictl edit-config -s failsafe_mode=true

or by calling the /config REST API endpoint.

Aug 10 '22 13:08 CyberDem0n

Pull Request Test Coverage Report for Build 3910374889

133 of 135 (98.52%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.002%) to 99.844%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
patroni/ha.py	109	111	98.2%
<!--	Total:	133	135

Totals
Change from base Build 3900371470:	0.002%
Covered Lines:	11506
Relevant Lines:	11524

💛 - Coveralls

Aug 10 '22 13:08 coveralls

Nice! Happy to see this. Should obviate the need for https://github.com/zalando/patroni/pull/2318 as well.

Aug 10 '22 18:08 thedodd

@thedodd yes and no. This implementation is trying to make a difference between the reasons why the update of the leader lock failed and the failsafe mechanism triggers only when we got a network communication exception or some sort of InternalError exception from DCS. Right now it is hard to guarantee that all possible corner cases are covered.

The #2318 simply considers any failed attempt to update the leader lock as a failsafe trigger and should not bring many surprises. Also, it is a very common request to have a Patroni just to handle configuration management (yeah, I know, it defeats the purpose of Patroni :) )

Aug 12 '22 06:08 CyberDem0n

@CyberDem0n just some quick feedback. I've created a docker image to test these changes, and my setup is as follows:

All code is present as of this branch's latest commit 772fe19.
I've configured a 2 member cluster, with failsafe_mode=true, running in a K8s cluster.

When I query their /failsafe endpoint, they return:

  "member0": "http://10.22.27.41:8008/patroni",
  "member1": "http://10.22.10.123:8008/patroni"

Cluster is stable and has a master & a replica.

If I then delete the K8s RoleBinding which grants Patroni access to the K8s API (for updating endpoints/services/labels &c), it looks like the primary is still demoting.

Technically, the K8s API will have returned a 403 (which I see in the Patroni logs). Does this circumvent the failsafe logic? Upon reviewing the code, it is not immediately obvious to me.

Aug 30 '22 20:08 thedodd

Hi,

Not claiming a review, just a note of playing with it locally (python3 ./patroni.py postgres0.yaml etc) and it works like a charm!

One note is that as primary does POST /failsafe to all known members, relying on authentication, we probably want to verify that a primary can POST to all other nodes before enabling failsafe (otherwise there will be a nasty surprise when the actual failsafe is triggered, as seen when adding postgres2.yaml to the cluster, since it is the only one defining authentication credentials):

2022-08-31 14:48:16,287 INFO: Got response from postgresql2 http://127.0.0.1:8010/patroni: no auth header received
2022-08-31 14:48:16,287 INFO: Got response from postgresql1 http://127.0.0.1:8009/patroni: Accepted
2022-08-31 14:48:16,288 INFO: demoting self because DCS is not accessible and I was a leader
2022-08-31 14:48:16,288 INFO: Demoting self (offline)

Aug 31 '22 12:08 alexeyklyukin

If I then delete the K8s RoleBinding which grants Patroni access to the K8s API (for updating endpoints/services/labels &c), it looks like the primary is still demoting. Technically, the K8s API will have returned a 403 (which I see in the Patroni logs). Does this circumvent the failsafe logic? Upon reviewing the code, it is not immediately obvious to me.

Hi @thedodd. The failsafe logic is triggered when DCS doesn't respond in time or when it is asking to repeat the request after some timeout. But, when retrying there could still be the same issue. In both cases Patroni is giving up after retry_timeout seconds and the failsafe mode is triggered. The 403 status code only indicates that DCS rejects requests from the current node, but in fact, it is healthy. I can't immediately tell whether it would be safe to use 403 in order to trigger the failsafe mechanism.

Aug 31 '22 13:08 CyberDem0n

@alexeyklyukin yeah, nodes must be able to communicate with each other using credentials and/or client certificates. Sure, we can call the POST /failsafe in order to make sure that authentication succeeds, but there are still many questions:

Should we completely ignore the failsafe mode in this case or just don't add the node to the list?
On the next iteration the check will be executed again. It will not only spam logs, but cause some increased resource usage.
What if the check succeeds initially but starts failing later? For example, someone changed credentials on the other node (it is not easy to detect). Or maybe some certificates expired.

Aug 31 '22 14:08 CyberDem0n

@CyberDem0n ok, quick update: by way of manually crippling the K8s API server pods of the testing cluster, I was able to induce the needed downtime scenario. Failsafe mode appears to have held the Patroni cluster steady (1 master, 1 replica). I could connect, reconnect, and continue to use the master as expected. No cluster topology changes during the control plane downtime.

Sep 01 '22 18:09 thedodd

hi, i try use this PR and get error in communicating with DCS, but i can't see that this feature is work

`2022-09-01 01:03:28,018 INFO: Reconnection allowed, looking for another server.
2022-09-01 01:03:28,018 ERROR: Error communicating with DCS
2022-09-01 01:03:28,019 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2022-09-01 01:03:28,020 INFO: demoting self because DCS is not accessible and I was a leader
2022-09-01 01:03:28,020 INFO: Demoting self (offline)
`

What is my mistake? i turn it in patroni.yaml

 dcs:
    ttl: 30
    loop_wait: 10
    failsafe_mode: true
    retry_timeout : 10
    maximum_lag_on_failover: 1048576

Sep 02 '22 05:09 anyafit

What is my mistake?

@anyafit for existing clusters one should use patronictl edit-config to enable it: https://github.com/zalando/patroni/blob/e54f736651743a2521bdf393f067ea191ae08e19/docs/dcs_failsafe_mode.rst You can also check whether the feature is enabled using patronictl show-config.

Sep 02 '22 05:09 CyberDem0n

by way of manually crippling the K8s API server pods of the testing cluster, I was able to induce the needed downtime scenario. Failsafe mode appears to have held the Patroni cluster steady

@thedodd about simple cases I am quite confident that it works. We have behave tests for that, that send SIGSTOP to the localkube process in order to simulate outage.

Sep 02 '22 05:09 CyberDem0n

about simple cases I am quite confident that it works

Awesome! Any particular edge cases you are concerned about that you think might need some additional testing? We talked about the 403 cases earlier, but I'm not concerned about that. I think the implementation as it is right now is doing the correct thing.

Sep 02 '22 15:09 thedodd

@CyberDem0n thoughts on when you would like to land this PR? Waiting on a few other reviews?

Sep 07 '22 16:09 thedodd

@thedodd it should be more sophisticated than just stopping the control plane. For example:

slow network
overloaded control-plane
leader not being able to access the control plane, with replicas not having this problem

Sep 15 '22 12:09 CyberDem0n

Hi @CyberDem0n, any updates on this feature 🤗 ?

Oct 19 '22 05:10 OlleLarsson

@OlleLarsson more people will test it - faster it will be merged and released. Feel free to participate :)

Oct 19 '22 05:10 CyberDem0n

@OlleLarsson more people will test it - faster it will be merged and released. Feel free to participate :)

Perhaps I can find some time. Is there any specific scenario that's extra interesting to test?

Oct 19 '22 05:10 OlleLarsson

@OlleLarsson ideally it should be something close to real problems with DCS.

Oct 19 '22 07:10 CyberDem0n

hi, i have some problem with connect to etcd. It's mean that problem solved too fast?

2022-10-27 20:18:44,385 INFO: Lock owner: patroni_hostname; I am patroni_hostname
2022-10-27 20:18:47,724 ERROR: Request to server http://etcd_hostname:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='etcd_hostname', port=2379): Read timed out. (read timeout=3.333218111967047)",)
2022-10-27 20:18:47,724 INFO: Reconnection allowed, looking for another server.
2022-10-27 20:18:47,724 INFO: Retrying on http://etcd_hostname_2:2379
2022-10-27 20:18:47,907 INFO: Selected new etcd server http://etcd_hostname_2:2379
2022-10-27 20:18:47,908 ERROR: watchprefix failed: ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
2022-10-27 20:18:48,401 INFO: no action. I am (patroni_hostname), the leader with the lock
2022-10-27 20:18:49,577 ERROR: Request to server http://etcd_hostname_2:2379 failed: MaxRetryError("HTTPConnectionPool(host='etcd_hostname_2', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f48d2f91c50>, 'Connection to etcd_hostname_2 timed out. (connect timeout=1.6666666666666667)'))",)
2022-10-27 20:18:49,577 INFO: Reconnection allowed, looking for another server.
2022-10-27 20:18:49,577 INFO: Retrying on http://etcd_hostname:2379
2022-10-27 20:18:49,755 INFO: Selected new etcd server http://etcd_hostname:2379
2022-10-27 20:18:55,924 INFO: no action. I am (patroni_hostname), the leader with the lock

Nov 02 '22 06:11 anyafit

👍

Jan 13 '23 12:01 hughcapet

:+1:

Jan 13 '23 12:01 CyberDem0n

patroni patroni copied to clipboard

DCS failsafe mode

Pull Request Test Coverage Report for Build 3910374889

💛 - Coveralls

patroni
patroni copied to clipboard