python-etcd When all machines are down client will not refresh cache until one saved in base

I found a bug when using library with etcd cluster and action allow_reconnect = True. When more than half of machines in cluster will go down the whole cluster will stop to responds (this is expected behavior). In this case api_execute will try each machine from _machines_cache and after all machines will fail it will leave _machines_cache empty. On every next call it will try to request to _base_uri and if it will not respond, it will try to get new endpoint from empty _machines_cache. This way until _base_uri will got up every request will fail, even if all other machines are up.

Nov 04 '15 16:11 pax0r

+1, we also had the same problem.

I guess the only possible way to solve this issue - refresh _machines_cache periodically, for example every 5 minutes.

I'm going to prepare pull request.

Nov 13 '15 09:11 CyberDem0n

Maybe we could try to refresh _machines_cache if it is empty before a call (or just after _base_uri does not respond).

In both case we need to handle a case when refreshing will fail (whole cluster is down).

Nov 13 '15 09:11 pax0r

So, let's reason of what we expect to happen if the cluster goes down; please tell me if you think what I'm describing is acceptable behaviour:

If the cluster doesn't respond, we raise exceptions
if the cluster recovers, we want the library to be able to reconnect without needing a new client to be instantiated

The way to implement this is probably to separate the cache of machines we get from the server and their failure. If all machines have failures, we want to raise an exception, but upon the next command if all are marked as down we want them to be all tested again.

Does that seem reasonable @pax0r @CyberDem0n

Nov 15 '15 11:11 lavagetto

@lavagetto, I guess you didn't got the problem correctly. I'll try to explain:

Let's imagine that we have cluster of three nodes: node1, node2, node3

In the beginning _base_uri points to the node1 and _machines_cache contains node2 and node3
After a while node2 is dying and the cluster gets node4 as the replacement.
Later the same happens with the node3, and it is replaced by node5.
At this point we have cluster: node1, node4, node5.
And finally node1 fails.
We are trying to execute request on node1 and it fails.
We are retrying with node2 and node3 and giving up.
At this point _machines_cache is empty and _base_uri points to the node3, which is not a member of the cluster any more.

Such situation can't be recovered... The only possibility to work it around - periodically refresh _machines_cache even if no requests have failed.

Nov 15 '15 14:11 CyberDem0n

In current version any situation in which _machines_cache is empty and _base_uri does not respond cannot be recovered. We use static server list so in our case refreshing _machines_cache with starting values is enough, but I agree with @CyberDem0n that it may not be enough in every case.

Nov 16 '15 10:11 pax0r

Ok so, I think I owe you all a design statement here:

python-etcd will never try to go beyond being a simple library. Doing things like scheduling periodic tasks is well_beyond the scope of this library. The interval would also be arbitrary and not work in every case.

It's pretty easy to create a derived class that will refresh the machines list every N seconds anyways

I will NOT implement periodic refreshes of the machines cache.

What we can do, and keep our implementation clean, is what follows:

SRV-based DNS discovery will store the DNS TTL.
If all servers in the machines cache do not respond and the TTL is expired, re-perform the dns query.
re-perform your actions with the new DNS-provided list

This will allow anyone to be able to phase out servers in a non-harmful way, without the need to respawn your python-etcd instance in the meanwhile.

AFAIK, no other client library is even remotely as resilent to failures as python-etcd, so while I understand your concerns, I'd advise you to start using DNS-based discovery of the cluster, that does really help maintaining things running without the need for reconfigurations/restarts.

Nov 21 '15 09:11 lavagetto

For mine usecase I believe it would be enough to recreate _machine_cache. In current implementation when _machine_cache and whole cluster is up again expect one machine (one set as _base_uri) the whole lib will not work until this one machine is up.

This is kind of confusing as in case of any other failure the lib will try to reconnect to other machines.

Nov 26 '15 14:11 pax0r

@pax0r your problem is with a very specific scenario where all your original machines fail but the one you initially connected to. I do get this is annoying but I don't think we should ever refresh the machines list without a failure happening by default. I have a few ideas about this, I'll give it a shot when I have the time for it.

Nov 30 '15 08:11 lavagetto

This isn't so specific, imagine a case when whole cluster goes down (client raises error on trying to connect to the last node) and after that cluster goes up without the last machine. I have project-specific hack on our project - just remember initial machines list and use it in case of _machine_cache is empty and _base_uri is not responding, but this is kinda project specific as we have static machines list - if peer is down it will go up with the same IP.

Nov 30 '15 09:11 pax0r

Howdy!

So, I have a question about this: lets say I started with 4 machines in _machines_cache (I'm going to call them 1-4 instead of 0-3 here since it might read easier). machine_[1-3] go down, so the code eventually pops machine_4 from _machines_cache. _machines_cache is now empty, and we're running on machine_4 as the _base_uri. But then machine_4 goes down before machine_[1-3] come back up. All nodes in the cluster are down. machines_cache is empty. Not much we can do here.

But maybe a minute or so later, lets say machine_2 comes back online, then machine_1, then machine_3, but never machine_4. It looks like, the way things are written, the client will never recover from this, since it's using machine_4 as the _base_uri to get an updated machines list. It's impossible to notice that machine_{1-3} are back online?

Now, I ran into a scenario today where the client thought there were no machines left in the machines_cache, and was basically caught in a tight loop yelling about Machines cache is empty, no machines to try.

I'm not quite sure why it never recovered, since my machine_4 did come back up. In my case, I resolved it by restarting my job, and everything was fine.

There is that code in _wrap_request, where, if some_request_failed, it will self._machines_cache = self.machines, so I would think eventually, when machine_4 came back up, it would have been fine.

I feel like I'm kind of babbling now, sorry! I'm just trying to figure out why things were unable to recover. I feel like if I give it a list of machines, all machines go down, and eventually one machine comes back up, it should be able to recover. I mean, I'm happy to write some code to deal with it, but I'm not sure if I'm doing something wrong, or if there's some room for improvement within the library. :)

Mar 07 '17 01:03 chet

python-etcd
python-etcd copied to clipboard

When all machines are down client will not refresh cache until one saved in base_uri won't be up

python-etcd python-etcd copied to clipboard

When all machines are down client will not refresh cache until one saved in base_uri won't be up

python-etcd
python-etcd copied to clipboard