gravity
gravity copied to clipboard
Planet agent serf connection hanging
Description
What happened: We have observed at least these symptoms with the long-lived planet agent serf connection:
- sometimes the underlying package will invalidate the connection on its own. We used to detect this in the serf client itself and implicitly reestablish the connection.
- possibly after suspend/hibernate, the existing serf connection goes into a state where the remote side is not responding properly and agent's side will continuously fail with
i/o timeout
error. Since the connection is not invalidated in this case, this effectively renders the agent incapable of communicating with serf until after a restart.
Also merged information from another ticket about the same issue (https://github.com/gravitational/gravity/issues/1215)
It has been observed a couple of times that after the cluster encounters some sort of networking issues and recovers, planet agent can't contact local serf server:
Mar 06 19:55:18 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: ERRO [PING-CHEC] write tcp 127.0.0.1:55480->127.0.0.1:7373: i/o timeout monitoring/ping.go:141
Mar 06 19:55:36 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: WARN Timed out collecting node statuses: context deadline exceeded. agent/agent.go:481
Mar 06 19:55:37 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: WARN Timed out collecting test results: context deadline exceeded. agent/agent.go:338
Mar 06 19:56:08 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: ERRO [PING-CHEC] write tcp 127.0.0.1:55480->127.0.0.1:7373: i/o timeout monitoring/ping.go:141
Serf itself is actually accessible e.g. serf members
(which calls the same 127.0.0.1:7373 rpc endpoint) works fine. So something might be happening with Go serf client that does not recover after some networking issues.
What you expected to happen: Agent is able to communicate with serf (subject to serf availability) and does not show symptoms as listed above.
How to reproduce it (as minimally and precisely as possible):
Environment
- Gravity version [e.g. 7.0.11]: 5.5.x
- OS [e.g. Redhat 7.4]: redhat 7
- Platform [e.g. Vmware, AWS]: aws
As of Nov 2nd, still needs to be ported to 7.0. Seems to be on all other active release branches.
As of Nov 2nd, still needs to be ported to 7.0. Seems to be on all other active release branches.
Hey @knisbet - did you add the relevant commit to your latest 7.0.28 release? couldn't find it :)
We're actually moving in the direction of dropping the use of serf entirely for cluster membership: https://github.com/gravitational/satellite/pull/284
Not on 7.0 yet though.
Hey @knisbet, is this already part of official Gravity release?
It's in the 7.1 alpha: https://goteleport.com/gravity/docs/changelog/#internal-changes