eetcd icon indicating copy to clipboard operation
eetcd copied to clipboard

Connection to etcd cluster isn't resilient

Open balena opened this issue 2 years ago • 10 comments

Steps to reproduce:

  1. Connect to etcd;
  2. Interrupt the connectivity, and wait for all connections to timeout/disconnect;
  3. Restablish connectivity.

Expected: eetcd reconnects to etcd cluster.

Behavior: it doesn't reconnect.

Instead, it loses the configuration initially passed at the eetcd:open function.

If you use auto_sync_interval_ms option, the following logs are generated, repeatedly:

[warning] <Name> has no active connections to etcd cluster, cannot get member list
[warning] <Name> has no active connections to etcd cluster, cannot get member list
...

At the client app, the error :eetcd_conn_unavailable is constantly returned.

balena avatar Apr 13 '22 18:04 balena

It can be a reasonable design choice for a client library to not perform connection recovery, in particular in Erlang, and let the developer do that.

I am not sure if that's the case here. @zhongwencool would you be interested in having this feature?

michaelklishin avatar Apr 13 '22 21:04 michaelklishin

Attempted to address the problem in https://github.com/zhongwencool/eetcd/pull/50.

It can be a reasonable design choice for a client library to not perform connection recovery

Yes, it can, but in this case it must exist a way for the client to discover that the connection is down, and provide a function to reconnect, which is not the case. It would worth comparing to the Go etcd/clientv3 implementation: it keeps reconnecting in the background.

Anyway, the PR solves the problem by reusing the health checks to refill the freeze_conns attribute from members_list and thus restablish the connections. But this means the client has to enable auto_sync_interval_ms option. Also, if a new node is discovered with the auto_sync_interval_ms option, the log output doesn't get loaded with repeated outputs like:

[notice] [msg: 'Got removed endpoints', removed_endpoints: [{Host, Port}]]
[notice] [msg: 'Got removed endpoints', removed_endpoints: [{Host, Port}]]
...

I couldn't understand the design decisions behind of erasing active connections and freeze connections at the same time. This means when both are empty we get a dangling eetcd connection... Is there a reason for that?

balena avatar Apr 14 '22 00:04 balena

Confirming this issue. Our Etcd cluster ran out of memory, so we had to restart it. The app (which uses this library) did not recover on its own. We had to restart the app pods to get working again.

I'm not sure how gRPC stuff works, but most network clients that I write use the awesome Connection module/behaviour which makes it super easy to manage reconnections for :gen_tcp sockets and what not.

cjbottaro avatar Apr 30 '22 21:04 cjbottaro

@cjbottaro what version of this library do you use? It should have been addressed by https://github.com/zhongwencool/eetcd/commit/40acb27a6ebfb995975de996adf72be61d603da0 which shipped in 0.3.5.

michaelklishin avatar Apr 30 '22 21:04 michaelklishin

mix deps reports 0.3.5...

* eetcd 0.3.5 (Hex package) (rebar3)
  locked at 0.3.5 (eetcd) af9d5158

Unfortunately, I didn't save the app logs when we restarted the Etcd cluster, but there were definitely messages from eetcd.

cjbottaro avatar May 01 '22 00:05 cjbottaro

There will be messages from eetcd since connection loss will result either in an error somewhere or just an error log message. The question is, how do we reproduce the failure to recover.

@cjbottaro can you put together a small app that demonstrates the non-recovery behavior and how your application supervises whatever processes use eetcd?

michaelklishin avatar May 01 '22 12:05 michaelklishin

It's easy to reproduce. The steps were described at the top of this issue.

In order to interrupt the connectivity of the client running eetcd just turn off the network interface, turn off the router, etc and keep it for a while until you see all nodes were disconnected. Then reconnect again, you'll see that eetcd does not reconnect.

At the moment you disable the connectivity, given that you enabled the keep alive mechanism, one by one, the connections to etcd nodes will drop. Before all nodes are dropped, if you establish connectivity, the remaining connections are maintained and eventually reconnect (as etcd nodes announce the cluster).

But the problem happens when all connections to etcd nodes drop. The eetcd library enters in a limbo, and the only solution is to reconnect manually.

It isn't a reasonable design to reconnect by inspecting each request: imagine an app that reads and writes to etcd. You will need to handle each error condition in every place you interact with eetcd and attempt a reconnection...

Also, what would be the purpose of the keep alive if it doesn't keep connections alive, neither informs the client that the connections have dropped?

balena avatar May 01 '22 14:05 balena

Yup, super easy to reproduce with Docker. In our case Kubernetes... we just deleted our cluster:

00:31:53.147 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:53.159 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:53.276 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:53.313 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:53.318 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:53.326 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:53.487 [warning] Etcd failed to connect [etcd-1:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:54.488 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> :timeout

00:31:54.493 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:54.496 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:31:55.497 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> :timeout

00:31:56.498 [warning] Etcd failed to connect [etcd-1:2379] by <Gun Down> :timeout

00:31:58.300 [warning] Etcd failed to connect [etcd-1:2379] by <Gun Down> :timeout

00:31:59.301 [warning] Etcd failed to connect [etcd-2:2379] by <Gun Down> :timeout

00:31:59.305 [warning] Etcd failed to connect [etcd-0:2379] by <Gun Down> {:shutdown, :econnrefused}

00:32:00.905 [info] ETCD(Etcd, #PID<0.5673.0>)'s connections are ready.

That last log message is super suspect, considering the cluster (including DNS) doesn't exist anymore at all.

cjbottaro avatar May 08 '22 00:05 cjbottaro

Gentle bump @michaelklishin @zhongwencool.

balena avatar Jun 21 '22 18:06 balena

This has not been forgotten but I personally won't be able to look into this in the near term.

michaelklishin avatar Jun 22 '22 11:06 michaelklishin

It can be a reasonable design choice for a client library to not perform connection recovery, in particular in Erlang, and let the developer do that.

I am not sure if that's the case here. @zhongwencool would you be interested in having this feature?

@michaelklishin I suppose it isn't a reasonable design choice. This problem only occurs when the connect mode is set to connect_all by default. The connection recovery mechanism works if you set the connect mode to random.

gilbertwong96 avatar Aug 09 '22 07:08 gilbertwong96

Good to know that it only affects one mode, thanks for digging in.

michaelklishin avatar Aug 09 '22 08:08 michaelklishin

Good to know that it only affects one mode, thanks for digging in.

@michaelklishin I fixed in #54.

gilbertwong96 avatar Aug 09 '22 09:08 gilbertwong96

0.3.6 is ready now.

zhongwencool avatar Aug 09 '22 09:08 zhongwencool