sensu-go icon indicating copy to clipboard operation
sensu-go copied to clipboard

Missing Keepalive Timeout Events for Agents stopped during a full-cluster outage

Open c-kruse opened this issue 2 years ago • 0 comments

Our keepalive reconstruction strategy (re-schedules timers for agent keepalives on backend startup) assumes that agent state does not change during a full cluster outage. Because of this, if any agents die during (or within their keepalive threshold) of a full backend cluster outage, when the cluster recovers it will not reschedule keepalive warnings, critical errors or agent deregistration until the agent reconnects.

Expected Behavior

Da keepalives recover and do the keepalive things.

Current Behavior

When the issue occurs, the entity remains unseen since it was last connected, and no keepalive events are generated to alert operators that the entity is absent.

Possible Solution

Steps to Reproduce (for bugs)

  1. Start a sensu-go backend with an external etcd cluster (or single instance)
  2. Connect two agents to the backend using the default keepalive configuration (AgentA and AgentB)
  3. See the keepalive event from your agents
  4. Stop AgentA, and wait ~2 minutes for the default keepalive interval to expire. You should see a new failing keepalive event for AgentA.
  5. Kill the backend, but leave etcd running
  6. Wait ~2 minutes for the default keepalive interval to expire. You should see a record corresponding to AgentA and AgentB deleted if you watch the etcd store etcdctl watch --prefix /sensu.io/switchsets/
  7. Start the backend back up
  8. Observe that failing keepalive events continue to be produced for AgentA, but not AgentB

Your Environment

Sensu 6.* External etcd (although it should be sporadically reproducible with embedded etcd)

c-kruse avatar Aug 03 '22 01:08 c-kruse