Leader memory leak when metrics are exposed but not consumed
Nomad version
1.10.3
Operating system and Environment details
AlmaLinux 9
Issue
Nomad servers (leaders) leak memory over time when metrics aren't consumed
Reproduction steps
Enable Telemetry in server leader config telemetry { prometheus_metrics = true publish_allocation_metrics = true publish_node_metrics = true }
Stop scraping metrics (our Promethus node went down for 24 hours)
Expected Result
No memory change
Actual Result
We started seeing leaders fill up on memory and fail. As soon as Promethus started scraping again memory usage stabilized
(Node was elected as a leader at 18:00, Promethus came back online and started scraping at 21:30)
Note this is documented in https://developer.hashicorp.com/nomad/docs/configuration/telemetry#prometheus_metrics
Nomad's Prometheus client retains metrics in memory unless scraped, so you should not enable this field unless you are collecting metrics via Promtheus.
But this is the official Go Prometheus client under the hood of go-metrics. I'm pretty sure it supports an expiry and we could just drop the metrics once they're sufficiently stale.
Note this is documented in https://developer.hashicorp.com/nomad/docs/configuration/telemetry#prometheus_metrics
Nomad's Prometheus client retains metrics in memory unless scraped, so you should not enable this field unless you are collecting metrics via Promtheus.
But this is the official Go Prometheus client under the hood of go-metrics. I'm pretty sure it supports an expiry and we could just drop the metrics once they're sufficiently stale.
D'oh! Understood. Definitely a bit of a footgun, been a while since I've been in this area and given it was an outage of the prom cluster wasn't paying close attention here. Will note for the future.
Hi @MattJustMatt and thanks for raising this issue. As Tim mentioned, go-metrics uses the official Prometheus client which includes an expiry value, so old metrics are deleted after a certain point. Nomad makes a call into go-metrics via the prometheus.NewPrometheusSink method which subsequently sets the default Prometheus options that includes an expiration of 60 seconds.
On the surface it therefore looks like the metrics should expire and control memory growth in situations like that one you experienced, however, I am unsure at this moment if the expiration is only called when a scrape is performed. I'll mark this for further investigation, so we can look deeper into the logic.
For future readers, the collectAtTime function holds the expiration logic.