python-etcd
python-etcd copied to clipboard
When all machines are down client will not refresh cache until one saved in base_uri won't be up
I found a bug when using library with etcd cluster and action allow_reconnect = True
.
When more than half of machines in cluster will go down the whole cluster will stop to responds (this is expected behavior).
In this case api_execute
will try each machine from _machines_cache
and after all machines will fail it will leave _machines_cache
empty.
On every next call it will try to request to _base_uri
and if it will not respond, it will try to get new endpoint from empty _machines_cache
.
This way until _base_uri
will got up every request will fail, even if all other machines are up.
+1, we also had the same problem.
I guess the only possible way to solve this issue - refresh _machines_cache
periodically, for example every 5 minutes.
I'm going to prepare pull request.
Maybe we could try to refresh _machines_cache
if it is empty before a call (or just after _base_uri does not respond).
In both case we need to handle a case when refreshing will fail (whole cluster is down).
So, let's reason of what we expect to happen if the cluster goes down; please tell me if you think what I'm describing is acceptable behaviour:
- If the cluster doesn't respond, we raise exceptions
- if the cluster recovers, we want the library to be able to reconnect without needing a new client to be instantiated
The way to implement this is probably to separate the cache of machines we get from the server and their failure. If all machines have failures, we want to raise an exception, but upon the next command if all are marked as down we want them to be all tested again.
Does that seem reasonable @pax0r @CyberDem0n
@lavagetto, I guess you didn't got the problem correctly. I'll try to explain:
Let's imagine that we have cluster of three nodes: node1, node2, node3
- In the beginning
_base_uri
points to the node1 and_machines_cache
contains node2 and node3 - After a while node2 is dying and the cluster gets node4 as the replacement.
- Later the same happens with the node3, and it is replaced by node5.
- At this point we have cluster: node1, node4, node5.
- And finally node1 fails.
- We are trying to execute request on node1 and it fails.
- We are retrying with node2 and node3 and giving up.
- At this point
_machines_cache
is empty and_base_uri
points to the node3, which is not a member of the cluster any more.
Such situation can't be recovered...
The only possibility to work it around - periodically refresh _machines_cache
even if no requests have failed.
In current version any situation in which _machines_cache
is empty and _base_uri
does not respond cannot be recovered. We use static server list so in our case refreshing _machines_cache
with starting values is enough, but I agree with @CyberDem0n that it may not be enough in every case.
Ok so, I think I owe you all a design statement here:
python-etcd will never try to go beyond being a simple library. Doing things like scheduling periodic tasks is well_beyond the scope of this library. The interval would also be arbitrary and not work in every case.
It's pretty easy to create a derived class that will refresh the machines list every N seconds anyways
I will NOT implement periodic refreshes of the machines cache.
What we can do, and keep our implementation clean, is what follows:
- SRV-based DNS discovery will store the DNS TTL.
- If all servers in the machines cache do not respond and the TTL is expired, re-perform the dns query.
- re-perform your actions with the new DNS-provided list
This will allow anyone to be able to phase out servers in a non-harmful way, without the need to respawn your python-etcd instance in the meanwhile.
AFAIK, no other client library is even remotely as resilent to failures as python-etcd, so while I understand your concerns, I'd advise you to start using DNS-based discovery of the cluster, that does really help maintaining things running without the need for reconfigurations/restarts.
For mine usecase I believe it would be enough to recreate _machine_cache. In current implementation when _machine_cache and whole cluster is up again expect one machine (one set as _base_uri) the whole lib will not work until this one machine is up.
This is kind of confusing as in case of any other failure the lib will try to reconnect to other machines.
@pax0r your problem is with a very specific scenario where all your original machines fail but the one you initially connected to. I do get this is annoying but I don't think we should ever refresh the machines list without a failure happening by default. I have a few ideas about this, I'll give it a shot when I have the time for it.
This isn't so specific, imagine a case when whole cluster goes down (client raises error on trying to connect to the last node) and after that cluster goes up without the last machine. I have project-specific hack on our project - just remember initial machines list and use it in case of _machine_cache is empty and _base_uri is not responding, but this is kinda project specific as we have static machines list - if peer is down it will go up with the same IP.
Howdy!
So, I have a question about this: lets say I started with 4 machines in _machines_cache
(I'm going to call them 1-4 instead of 0-3 here since it might read easier). machine_[1-3]
go down, so the code eventually pops machine_4
from _machines_cache
. _machines_cache
is now empty, and we're running on machine_4
as the _base_uri
. But then machine_4
goes down before machine_[1-3]
come back up. All nodes in the cluster are down. machines_cache
is empty. Not much we can do here.
But maybe a minute or so later, lets say machine_2
comes back online, then machine_1
, then machine_3
, but never machine_4
. It looks like, the way things are written, the client will never recover from this, since it's using machine_4
as the _base_uri
to get an updated machines list. It's impossible to notice that machine_{1-3}
are back online?
Now, I ran into a scenario today where the client thought there were no machines left in the machines_cache
, and was basically caught in a tight loop yelling about Machines cache is empty, no machines to try.
I'm not quite sure why it never recovered, since my machine_4
did come back up. In my case, I resolved it by restarting my job, and everything was fine.
There is that code in _wrap_request
, where, if some_request_failed
, it will self._machines_cache = self.machines
, so I would think eventually, when machine_4
came back up, it would have been fine.
I feel like I'm kind of babbling now, sorry! I'm just trying to figure out why things were unable to recover. I feel like if I give it a list of machines, all machines go down, and eventually one machine comes back up, it should be able to recover. I mean, I'm happy to write some code to deal with it, but I'm not sure if I'm doing something wrong, or if there's some room for improvement within the library. :)