vip-manager Handle etcd hickups more gracefully

(maybe consul has the same behaviour)

Right now, if vip-manager cannot get a positive answer from etcd after the interval main-loop, it deconfigures the VIP, resulting in application downtime. This happens also during leader election, maybe because one etcd node was too slow to respond (while the PostgreSQL leader is perfectly fine and available).

The admin could increase the interval timeout in order to make vip-manager more resilient against etcd issues, but that will also increase the time it takes for vip-manager to pickup Patroni failovers/switchovers.

As a practical (contrived) example, if I have one external etcd server, and a 3-node vip-manager-1.0.1/patroni-2.0.1 cluster with interval, loop_wait and retry_timeout all set to 10, and I SIGSTOP the etcd server, then I see the following:

Dez 07 21:41:07 pg1 vip-manager[27203]: 2020/12/07 21:41:07 IP address 10.0.0.2/24 state is true, desired true
Dez 07 21:41:08 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:08,088 INFO: Lock owner: pg1; I am pg1
Dez 07 21:41:08 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:08,099 INFO: no action.  i am the leader with the lock
Dez 07 21:41:27 pg1 vip-manager[27203]: 2020/12/07 21:41:27 IP address 10.0.0.2/24 state is true, desired true
Dez 07 21:41:28 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:28,088 INFO: Lock owner: pg1; I am pg1
Dez 07 21:41:28 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:28,099 INFO: no action.  i am the leader with the lock

Everything cool, now I SIGSTOP etcd.

Dez 07 21:41:34 pg1 vip-manager[27203]: 2020/12/07 21:41:34 etcd error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://192.168.122.189:2379 exceeded header timeout
Dez 07 21:41:34 pg1 vip-manager[27203]: 2020/12/07 21:41:34 IP address 10.0.0.2/24 state is true, desired false
Dez 07 21:41:34 pg1 vip-manager[27203]: 2020/12/07 21:41:34 Removing address 10.0.0.2/24 on eth0

6 seconds later, vip-manager's loop wakes up, sees that etcd is down and immediately deconfigures the VIP. (EDIT: I wonder why that did not happen at the scheduled interval timeout of Dez 07 21:41:37?)

Dez 07 21:41:34 pg1 vip-manager[27203]: 2020/12/07 21:41:34 IP address 10.0.0.2/24 state is false, desired false
Dez 07 21:41:37 pg1 vip-manager[27203]: 2020/12/07 21:41:37 IP address 10.0.0.2/24 state is false, desired false
Dez 07 21:41:41 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:41,422 WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='192.168.122.189', port=2379): Read timed out. (read timeout=3.333155213321637)")': /v2/keys/postgresql-common/13-test3/?recursive=true

Now patroni realized something is wrong (though I'm not sure why it took it another 3 seconds from the scheduled ping timestamp of Dez 07 21:41:38 - maybe Patroni has much longer timeouts to wait for an etcd response than vip-manager?

Dez 07 21:41:44 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:44,759 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='192.168.122.189', port=2379): Read timed out. (read timeout=3.3322158453423376)")': /v2/keys/postgresql-common/13-test3/?recursive=true
Dez 07 21:41:45 pg1 vip-manager[27203]: 2020/12/07 21:41:45 etcd error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://192.168.122.189:2379 exceeded header timeout
Dez 07 21:41:47 pg1 vip-manager[27203]: 2020/12/07 21:41:47 IP address 10.0.0.2/24 state is false, desired false
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:48,096 ERROR: Request to server http://192.168.122.189:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'192.168.122.189\', port=2379): Max retries exceeded with url: /v2/keys/postgresql-common/13-test3/?recursive=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'192.168.122.189\', port=2379): Read timed out. (read timeout=3.3330249423355176)"))')
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:48,096 INFO: Reconnection allowed, looking for another server.
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:48,097 ERROR: get_cluster
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: Traceback (most recent call last):
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 590, in _load_cluster
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     result = self.retry(self._client.read, self.client_path(''), recursive=True)
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 443, in retry
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     return retry(*args, **kwargs)
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/patroni/utils.py", line 333, in __call__
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     return func(*args, **kwargs)
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/etcd/client.py", line 595, in read
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     response = self.api_execute(
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 271, in api_execute
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     raise ex
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 255, in api_execute
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:   File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 232, in _do_http_request
Dez 07 21:41:48 pg1 patroni@13-test3[26334]:     raise etcd.EtcdConnectionFailed('No more machines in the cluster')
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: etcd.EtcdConnectionFailed: No more machines in the cluster
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:48,098 ERROR: Error communicating with DCS
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:48,237 INFO: closed patroni connection to the postgresql cluster
Dez 07 21:41:48 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:48,501 INFO: postmaster pid=27452
Dez 07 21:41:48 pg1 patroni@13-test3[27453]: /var/run/postgresql/:5435 - keine Antwort
Dez 07 21:41:48 pg1 patroni@13-test3[27452]: 2020-12-07 21:41:48.510 UTC [27452] LOG:  Logausgabe wird an Logsammelprozess umgeleitet
Dez 07 21:41:48 pg1 patroni@13-test3[27452]: 2020-12-07 21:41:48.510 UTC [27452] TIPP:  Die weitere Logausgabe wird im Verzeichnis »/var/log/postgresql« erscheinen.
Dez 07 21:41:49 pg1 patroni@13-test3[27459]: /var/run/postgresql/:5435 - Verbindungen werden angenommen
Dez 07 21:41:49 pg1 patroni@13-test3[27461]: /var/run/postgresql/:5435 - Verbindungen werden angenommen
Dez 07 21:41:49 pg1 patroni@13-test3[26334]: 2020-12-07 21:41:49,575 INFO: demoted self because DCS is not accessible and i was a leader

Only now (around 10 seconds after vip-manager) does Patroni deconfigure itself. If I SIGCONT etcd between 10-20 seconds after I do SIGSTOP, Patroni marches on without user-visable changes, while vip-manager takes down the VIP.

So I think having a tight main loop interval in order to notice leader key changes, but a second retry interval in order to decide whether the DCS is down or not would be good. I assumed the retry-after and retry-num configuration parameters were that, but that does not seem to be the case (see #68).

Dec 07 '20 21:12 mbanck

Ok, there's quite a few things to go through here...

time between vip-manager log entries is not in the neighbourhood of the interval configured for vip-manager:

This is probably caused by the fact that the "state is true, desired true" message is printed after verifying that the VIP is still registered locally. Not only is it probable that this check itself takes some time, but the fact that the checker and the manager parts communicate asynchronously through a channel and that the manager is on an entirely different loop cycle could also play a role in this discrepancy. The etcd error is logged almost at the beginning of the checker loop, the "state is true, desired true" is logged at the end of the manager's VIP verification.

why it took it another 3 seconds from the scheduled ping timestamp of Dez 07 21:41:38 - maybe Patroni has much longer timeouts to wait for an etcd response than vip-manager:

The answer is right in the log line, patroni waited for read timeout=3.333155213321637 seconds. Vip-manager is hard-coded to a timeout of 1 second.

vip-manager deconfigures the VIP before patroni stops the database:

At the moment vip-manager employs no "hysteresis" to gap disruptions within the DCS, or to allow patroni to shutdown the database, terminating any still open connections. This was never really the design goal for vip-manager - it was primarily created to cover the switchovers/failovers as quickly as possible. I can only agree that your example is contrived - I personally see a continuously healthy DCS cluster as an invariant to operating patroni and thus vip-manager. I've never noticed such issues with regular leader changes within the etcd cluster (where all etcd cluster members keep on running, and only the leader role is transferred). But I can imagine that such issues occur when one of the cluster member used by vip-manager dies and the other cluster members fail to provide an answer within that one second that we allow the k/v-store get to take.

But I can understand some of your pain, as I've witnessed a few cases where vip-manager continuously deregistered and then re-registered the VIP because the leader key's TTL was expiring "too fast" in the k/v store. Moments later, patroni detected it and set the leader key again. Because patroni instantly succeeds in setting the key, it will not demote the primary. This behaviour can occur when the clocks of the etcd cluster members are not adequately in sync, which - again - I consider to be an invariant to the successful operation of patroni and vip-manager. However, it could still be useful to cover such cases a little more graciously.

Here are some ideas to cover some of these issues:

make vip-manager log the desired state as soon as it picks them up (in the checker part), not only when we've reached the manager part, which runs asynchronously.
Add a reasonable quick retry option to cover intermittent connectivity issues. For this, I deem the retry_num and retry_after parameters in the right ballpark and would like to reuse them. (ad #68 )
Make the timeout for DCS interactions configurable.
Add some hysteresis to cover cases in which the leader key goes away (primary looses lock with no replica stepping up, or clock sync issues lead to premature key expiry) or when the DCS is still unable to server an answer within retry_num*retry_after . Setting this hysteresis to the TTL of the leader key used by patroni should cover all cases. However, when the leader key actually changes (switchover/failover), we still need to react as fast as possible, without employing the hysteresis. There might be a corner-case hidden here, with regards to the host containing the database primary loosing connectivity to all other hosts (including etcd) on its primary interface, while still being reachable through the VIP on a secondary interface. In this (contrived) corner-case, the VIP could then be in use by two different machines for up to the hysteresis timeout.

I look forward to your feedback, though I can't promise any timeframes for when I'll be able to tackle this. I think most of the work needs to go into careful considerations of the implications of introducing hysteresis. I would be fine with changing the first three items and leaving the hysteresis as a nice-to-have. I think that would be a decent middleground that could cover most issues.

Dec 15 '20 21:12 markwort

Having similar issues.

May 24 09:46:21 dm1-b patroni: 2022-05-24 09:46:21,311 INFO: Lock owner: postgresql2; I am postgresql2 May 24 09:46:21 dm1-b patroni: 2022-05-24 09:46:21,319 INFO: no action. i am the leader with the lock May 24 09:46:25 dm1-b bash: 2022/05/24 09:46:25 etcd error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.5.25.87:2379 exceeded header timeout May 24 09:46:25 dm1-b bash: 2022/05/24 09:46:25 IP address 10.5.25.80/24 state is true, desired false May 24 09:46:25 dm1-b bash: 2022/05/24 09:46:25 Removing address 10.5.25.80/24 on ens192 May 24 09:46:25 dm1-b bash: 2022/05/24 09:46:25 IP address 10.5.25.80/24 state is false, desired false May 24 09:46:27 dm1-b bash: 2022/05/24 09:46:27 etcd error: client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint http://10.5.25.87:2379 exceeded header timeout May 24 09:46:28 dm1-b bash: 2022/05/24 09:46:28 IP address 10.5.25.80/24 state is false, desired true May 24 09:46:28 dm1-b bash: 2022/05/24 09:46:28 Configuring address 10.5.25.80/24 on ens192

Sometimes it is clear the etcd cannot keep up (probably slow VM disk IO). But not always visible like this (See above). May 24 06:33:31 dm1-b etcd: sync duration of 2.463669091s, expected less than 1s In some cases I also se such sync warnings on a physical nvme drives (should not have IO issues).

I cannot fully pinpoint the disk IO issue or maybe I can play with some timeouts to better handle such cases?

etcd settings: --heartbeat-interval 1000 --election-timeout 5000

Patroni settings: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576

May 24 '22 11:05 sebek72

📅 This issue has been automatically marked as stale because lack of recent activity. It will be closed if no further activity occurs. ♻️ If you think there is new information allowing us to address the issue, please reopen it and provide us with updated details. 🤝 Thank you for your contributions.

Aug 10 '22 11:08 github-actions[bot]

vip-manager vip-manager copied to clipboard

Handle etcd hickups more gracefully

vip-manager
vip-manager copied to clipboard