gravity icon indicating copy to clipboard operation
gravity copied to clipboard

Planet agent serf connection hanging

Open a-palchikov opened this issue 4 years ago • 5 comments

Description

What happened: We have observed at least these symptoms with the long-lived planet agent serf connection:

  1. sometimes the underlying package will invalidate the connection on its own. We used to detect this in the serf client itself and implicitly reestablish the connection.
  2. possibly after suspend/hibernate, the existing serf connection goes into a state where the remote side is not responding properly and agent's side will continuously fail with i/o timeout error. Since the connection is not invalidated in this case, this effectively renders the agent incapable of communicating with serf until after a restart.

Also merged information from another ticket about the same issue (https://github.com/gravitational/gravity/issues/1215)

It has been observed a couple of times that after the cluster encounters some sort of networking issues and recovers, planet agent can't contact local serf server:

Mar 06 19:55:18 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: ERRO [PING-CHEC] write tcp 127.0.0.1:55480->127.0.0.1:7373: i/o timeout monitoring/ping.go:141
Mar 06 19:55:36 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: WARN             Timed out collecting node statuses: context deadline exceeded. agent/agent.go:481
Mar 06 19:55:37 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: WARN             Timed out collecting test results: context deadline exceeded. agent/agent.go:338
Mar 06 19:56:08 ip-172-31-0-5.ec2.internal /usr/bin/planet[719]: ERRO [PING-CHEC] write tcp 127.0.0.1:55480->127.0.0.1:7373: i/o timeout monitoring/ping.go:141

Serf itself is actually accessible e.g. serf members (which calls the same 127.0.0.1:7373 rpc endpoint) works fine. So something might be happening with Go serf client that does not recover after some networking issues.

What you expected to happen: Agent is able to communicate with serf (subject to serf availability) and does not show symptoms as listed above.

How to reproduce it (as minimally and precisely as possible):

Environment

  • Gravity version [e.g. 7.0.11]: 5.5.x
  • OS [e.g. Redhat 7.4]: redhat 7
  • Platform [e.g. Vmware, AWS]: aws

a-palchikov avatar Aug 13 '20 11:08 a-palchikov

As of Nov 2nd, still needs to be ported to 7.0. Seems to be on all other active release branches.

knisbet avatar Nov 02 '20 15:11 knisbet

As of Nov 2nd, still needs to be ported to 7.0. Seems to be on all other active release branches.

Hey @knisbet - did you add the relevant commit to your latest 7.0.28 release? couldn't find it :)

snirkatriel avatar Dec 09 '20 07:12 snirkatriel

We're actually moving in the direction of dropping the use of serf entirely for cluster membership: https://github.com/gravitational/satellite/pull/284

Not on 7.0 yet though.

knisbet avatar Dec 11 '20 15:12 knisbet

Hey @knisbet, is this already part of official Gravity release?

guykanyvision avatar Apr 13 '21 08:04 guykanyvision

It's in the 7.1 alpha: https://goteleport.com/gravity/docs/changelog/#internal-changes

knisbet avatar Apr 13 '21 13:04 knisbet