nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Leader memory leak when metrics are exposed but not consumed

Open MattJustMatt opened this issue 4 months ago • 3 comments

Nomad version

1.10.3

Operating system and Environment details

AlmaLinux 9

Issue

Nomad servers (leaders) leak memory over time when metrics aren't consumed

Reproduction steps

Enable Telemetry in server leader config telemetry { prometheus_metrics = true publish_allocation_metrics = true publish_node_metrics = true }

Stop scraping metrics (our Promethus node went down for 24 hours)

Expected Result

No memory change

Actual Result

We started seeing leaders fill up on memory and fail. As soon as Promethus started scraping again memory usage stabilized

(Node was elected as a leader at 18:00, Promethus came back online and started scraping at 21:30) Image

MattJustMatt avatar Oct 29 '25 15:10 MattJustMatt

Note this is documented in https://developer.hashicorp.com/nomad/docs/configuration/telemetry#prometheus_metrics

Nomad's Prometheus client retains metrics in memory unless scraped, so you should not enable this field unless you are collecting metrics via Promtheus.

But this is the official Go Prometheus client under the hood of go-metrics. I'm pretty sure it supports an expiry and we could just drop the metrics once they're sufficiently stale.

tgross avatar Oct 29 '25 15:10 tgross

Note this is documented in https://developer.hashicorp.com/nomad/docs/configuration/telemetry#prometheus_metrics

Nomad's Prometheus client retains metrics in memory unless scraped, so you should not enable this field unless you are collecting metrics via Promtheus.

But this is the official Go Prometheus client under the hood of go-metrics. I'm pretty sure it supports an expiry and we could just drop the metrics once they're sufficiently stale.

D'oh! Understood. Definitely a bit of a footgun, been a while since I've been in this area and given it was an outage of the prom cluster wasn't paying close attention here. Will note for the future.

MattJustMatt avatar Oct 29 '25 16:10 MattJustMatt

Hi @MattJustMatt and thanks for raising this issue. As Tim mentioned, go-metrics uses the official Prometheus client which includes an expiry value, so old metrics are deleted after a certain point. Nomad makes a call into go-metrics via the prometheus.NewPrometheusSink method which subsequently sets the default Prometheus options that includes an expiration of 60 seconds.

On the surface it therefore looks like the metrics should expire and control memory growth in situations like that one you experienced, however, I am unsure at this moment if the expiration is only called when a scrape is performed. I'll mark this for further investigation, so we can look deeper into the logic.

For future readers, the collectAtTime function holds the expiration logic.

jrasell avatar Oct 30 '25 07:10 jrasell