sensu-go
sensu-go copied to clipboard
Add agent-listen-wait-time backend configuration
Adds an --agent-listen-wait-time
flag (default 0) to allow users to configure a time in seconds to wait on startup before accepting agent connections.
When the backend is started agentd will bind to the agent port (8081 by default). When an agent-listen-wait-time is set, the agent websocket route /
should return a http 503 until the wait time has expired. After the wait time has expired agentd will begin accepting agent connections as usual. At startup the /health
route should begin serving traffic as usual.
https://github.com/sensu/sensu-go/issues/4827 exposes details of this to a health endpoint to not leave users in the dark.
Why
Provides built-in support for the most successful approach to mitigate unstable backend startups documented here: https://github.com/sensu/sensu-enterprise-go/issues/2373
What’s the downside of not binding vs responding with 503? My concern is whether significant enough load could be generated just by thousands of connections from agents hitting the websocket port and then retrying over and over.
I am not particularly worried with handling the load with a canned 503 responses. Agents have had a pretty reasonable default backoff strategy (starting at 1s and doubling each retry until 2 minutes) since 6.4.1. Before that they it was a little busier (starting at 10ms and 10xing until 10s). So as long as folks aren't running quite old agent fleets we're likely to peek no higher than 1 connection per agent per second. Older agent fleets may hit 3 connections per agent per second in a worst case scenario. That is definitely okay at 10k agents on recommended hardware.
Pros:
- It matches the way API needs to work (since it serves the /health endpoint.) Also agentd also serves the same /health endpoint on the agent port.
- It's more explicit. Agents today should log a
handshake failed with status 503
error so it will be clear that they were able to connect to a starting sensu backend that is temporarily unavailable. - I could be off here, but I think it'll help avoid "wait, so what's your load balancer config?" sort situations. Direct agent connections and pretty much any functioning LB configuration should all result in a relatively similar experience on the agent side.
Cons:
- One could configure agents to DDoS backends using mean retry settings.
- More time spent getting agents reconnected through a well configured layer 4 proxy. Instead of the LB routing traffic only to ready backends, it would include a 503ing backend in its server pool. *could subjectively be a pro, this behavior could be desirable in busy clusters.