sensu-go
sensu-go copied to clipboard
Missing Keepalive Timeout Events for Agents stopped during a full-cluster outage
Our keepalive reconstruction strategy (re-schedules timers for agent keepalives on backend startup) assumes that agent state does not change during a full cluster outage. Because of this, if any agents die during (or within their keepalive threshold) of a full backend cluster outage, when the cluster recovers it will not reschedule keepalive warnings, critical errors or agent deregistration until the agent reconnects.
Expected Behavior
Da keepalives recover and do the keepalive things.
Current Behavior
When the issue occurs, the entity remains unseen since it was last connected, and no keepalive events are generated to alert operators that the entity is absent.
Possible Solution
Steps to Reproduce (for bugs)
- Start a sensu-go backend with an external etcd cluster (or single instance)
- Connect two agents to the backend using the default keepalive configuration (AgentA and AgentB)
- See the keepalive event from your agents
- Stop AgentA, and wait ~2 minutes for the default keepalive interval to expire. You should see a new failing keepalive event for AgentA.
- Kill the backend, but leave etcd running
- Wait ~2 minutes for the default keepalive interval to expire. You should see a record corresponding to AgentA and AgentB deleted if you watch the etcd store
etcdctl watch --prefix /sensu.io/switchsets/
- Start the backend back up
- Observe that failing keepalive events continue to be produced for AgentA, but not AgentB
Your Environment
Sensu 6.* External etcd (although it should be sporadically reproducible with embedded etcd)