nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Make Nomad's HTTP server KeepAlive settings configurable

Open danlsgiga opened this issue 1 year ago • 0 comments

Why?

We are introducing a proxy in front of Nomad's Server HTTP API. As we load tested this new layer of our infrastructure we were bugged by constant and reproducible EOF errors and read: connection reset by peer errors when the load test was reaching its peak of ~68 concurrent requests.

Because this is an undocumented and hardcoded setting we went from tuning the TCP stack of our ECS cluster (where the proxy is running) all the way down to tcpdumps and straces trying to figure out why this was happening.

TL;DR

https://github.com/hashicorp/nomad/blob/main/command/agent/http.go#L311 defines a default keepalive of 30s and our proxy has a 2 min default (we tuned it down to 60s during our debug but that didn't help).

This means that Nomad was closing the connection prematurely while our Proxy was considering it to still be alive. The healthcheck starts to get random 502's and mark the upstream as unhealthy.

When we set the keepalive value on our proxy config to 15s we could not reproduce the issue anymore.

Proposal

Can the Nomad HTTP server keepalive settings be configurable or at least match the wait time default of 5m defined for Blocking Queries? If not, can this be documented with a warning somewhere to save someone's else days of hair pulling?

danlsgiga avatar Dec 16 '22 16:12 danlsgiga