eetcd
eetcd copied to clipboard
Connection to etcd cluster isn't resilient
Steps to reproduce:
- Connect to
etcd
; - Interrupt the connectivity, and wait for all connections to timeout/disconnect;
- Restablish connectivity.
Expected: eetcd
reconnects to etcd cluster.
Behavior: it doesn't reconnect.
Instead, it loses the configuration initially passed at the eetcd:open
function.
If you use auto_sync_interval_ms
option, the following logs are generated, repeatedly:
[warning] <Name> has no active connections to etcd cluster, cannot get member list
[warning] <Name> has no active connections to etcd cluster, cannot get member list
...
At the client app, the error :eetcd_conn_unavailable
is constantly returned.
It can be a reasonable design choice for a client library to not perform connection recovery, in particular in Erlang, and let the developer do that.
I am not sure if that's the case here. @zhongwencool would you be interested in having this feature?
Attempted to address the problem in https://github.com/zhongwencool/eetcd/pull/50.
It can be a reasonable design choice for a client library to not perform connection recovery
Yes, it can, but in this case it must exist a way for the client to discover that the connection is down, and provide a function to reconnect, which is not the case. It would worth comparing to the Go etcd/clientv3 implementation: it keeps reconnecting in the background.
Anyway, the PR solves the problem by reusing the health checks to refill the freeze_conns
attribute from members_list
and thus restablish the connections. But this means the client has to enable auto_sync_interval_ms
option. Also, if a new node is discovered with the auto_sync_interval_ms
option, the log output doesn't get loaded with repeated outputs like:
[notice] [msg: 'Got removed endpoints', removed_endpoints: [{Host, Port}]]
[notice] [msg: 'Got removed endpoints', removed_endpoints: [{Host, Port}]]
...
I couldn't understand the design decisions behind of erasing active connections and freeze connections at the same time. This means when both are empty we get a dangling eetcd
connection... Is there a reason for that?
Confirming this issue. Our Etcd cluster ran out of memory, so we had to restart it. The app (which uses this library) did not recover on its own. We had to restart the app pods to get working again.
I'm not sure how gRPC stuff works, but most network clients that I write use the awesome Connection
module/behaviour which makes it super easy to manage reconnections for :gen_tcp
sockets and what not.
@cjbottaro what version of this library do you use? It should have been addressed by https://github.com/zhongwencool/eetcd/commit/40acb27a6ebfb995975de996adf72be61d603da0 which shipped in 0.3.5
.
mix deps
reports 0.3.5
...
* eetcd 0.3.5 (Hex package) (rebar3)
locked at 0.3.5 (eetcd) af9d5158
Unfortunately, I didn't save the app logs when we restarted the Etcd cluster, but there were definitely messages from eetcd
.
There will be messages from eetcd
since connection loss will result either in an error somewhere or just an error log message. The question is, how do we reproduce the failure to recover.
@cjbottaro can you put together a small app that demonstrates the non-recovery behavior and how your application supervises whatever processes use eetcd
?
It's easy to reproduce. The steps were described at the top of this issue.
In order to interrupt the connectivity of the client running eetcd just turn off the network interface, turn off the router, etc and keep it for a while until you see all nodes were disconnected. Then reconnect again, you'll see that eetcd does not reconnect.
At the moment you disable the connectivity, given that you enabled the keep alive mechanism, one by one, the connections to etcd nodes will drop. Before all nodes are dropped, if you establish connectivity, the remaining connections are maintained and eventually reconnect (as etcd nodes announce the cluster).
But the problem happens when all connections to etcd nodes drop. The eetcd library enters in a limbo, and the only solution is to reconnect manually.
It isn't a reasonable design to reconnect by inspecting each request: imagine an app that reads and writes to etcd. You will need to handle each error condition in every place you interact with eetcd and attempt a reconnection...
Also, what would be the purpose of the keep alive if it doesn't keep connections alive, neither informs the client that the connections have dropped?
Yup, super easy to reproduce with Docker. In our case Kubernetes... we just deleted our cluster:
00:31:53.147 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:53.159 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:53.276 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:53.313 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:53.318 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:53.326 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:53.487 [warning] Etcd failed to connect [etcd-1:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:54.488 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> :timeout
00:31:54.493 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:54.496 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:31:55.497 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> :timeout
00:31:56.498 [warning] Etcd failed to connect [etcd-1:2379] by <Gun Down> :timeout
00:31:58.300 [warning] Etcd failed to connect [etcd-1:2379] by <Gun Down> :timeout
00:31:59.301 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> :timeout
00:31:59.305 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}
00:32:00.905 [info] ETCD(Etcd, #PID<0.5673.0>)'s connections are ready.
That last log message is super suspect, considering the cluster (including DNS) doesn't exist anymore at all.
Gentle bump @michaelklishin @zhongwencool.
This has not been forgotten but I personally won't be able to look into this in the near term.
It can be a reasonable design choice for a client library to not perform connection recovery, in particular in Erlang, and let the developer do that.
I am not sure if that's the case here. @zhongwencool would you be interested in having this feature?
@michaelklishin I suppose it isn't a reasonable design choice. This problem only occurs when the connect mode
is set to connect_all
by default. The connection recovery mechanism works if you set the connect mode
to random
.
Good to know that it only affects one mode, thanks for digging in.
Good to know that it only affects one mode, thanks for digging in.
@michaelklishin I fixed in #54.
0.3.6
is ready now.