nomad
nomad copied to clipboard
Make Nomad's HTTP server KeepAlive settings configurable
Why?
We are introducing a proxy in front of Nomad's Server HTTP API. As we load tested this new layer of our infrastructure we were bugged by constant and reproducible EOF
errors and read: connection reset by peer
errors when the load test was reaching its peak of ~68 concurrent requests.
Because this is an undocumented and hardcoded setting we went from tuning the TCP stack of our ECS cluster (where the proxy is running) all the way down to tcpdumps and straces trying to figure out why this was happening.
TL;DR
https://github.com/hashicorp/nomad/blob/main/command/agent/http.go#L311 defines a default keepalive
of 30s and our proxy has a 2 min default (we tuned it down to 60s during our debug but that didn't help).
This means that Nomad was closing the connection prematurely while our Proxy was considering it to still be alive. The healthcheck starts to get random 502's and mark the upstream as unhealthy.
When we set the keepalive
value on our proxy config to 15s
we could not reproduce the issue anymore.
Proposal
Can the Nomad HTTP server keepalive
settings be configurable or at least match the wait
time default of 5m
defined for Blocking Queries? If not, can this be documented with a warning somewhere to save someone's else days of hair pulling?