Potential memory leak on servers since 1.10.0
Hi! The investigation on this issue is still ongoing, but I wanted to report issues with the latest release as soon as I got them.
Nomad version
Nomad v1.10.0 BuildDate 2025-04-09T16:40:54Z Revision e26a2bd2acac2dcdcb623f4d293bac096beef478
Operating system and Environment details
Ubuntu 24.04.2 LTS
Issue
Memory usage steadily increasing for Nomad Server instances.
Context to this screenshot:
- The upgrade to v1.10.0 was yesterday between 12:00 and 16:00.
- Regular SIGHUP-reloads are happening due to certificate renewals by consul-template.
- The cluster has been wanting a new leader quite a lot since the upgrade. Easily once every 10 minutes.
All three server nodes seem affected by this. For example regular spikes up to 3GB and then down to 300MB.
Reproduction steps
- Have a 3-node Server cluster
- Probably unrelated: We have 3 clients, and then another 3 Server nodes in another region federated. The problematic region is the authoritative one
- The memory usage in the federated (non-authorattive) region seems unaffected - stable and low.
- Wait a day
Expected Result
Memory usage to stay roughly the same.
Actual Result
Memory usage increasing.
Job file (if appropriate)
Not applicable.
Nomad Server logs (if appropriate)
Here I am noticing that my cluster became very unstable since the 1.10 upgrade.
Also worth noting: Network latency between nodes is less than 1ms (same rack), but the nomad.raft.leader.lastContact metric showed values between 200 and 600ms (+= 200ms). Not great. This metric on the non-authoritative region is about 1-100 += 26ms, which is great (despite being slower hardware). According to Consul:
# roundtrip time
Minimum 0.43ms
Median 0.7ms
Maximum 0.8ms
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.332364Z","peer":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3"}
{"@level":"warn","@message":"appendEntries rejected, sending older logs","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.338291Z","next":154265,"peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.400207Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187027Z","server-id":"607347df-65b5-95b8-2ab3-6e0138a30db7","time":508388073}
{"@level":"warn","@message":"failed to contact quorum of nodes, stepping down","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187102Z"}
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187137Z","follower":{},"leader-address":"","leader-id":""}
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.249755Z","peer":{"Suffrage":1,"ID":"607347df-65b5-95b8-2ab3-6e0138a30db7","Address":"172.16.1.235:4647"}}
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2025-04-11T13:19:09.323127Z"}
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.560093Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.787542Z","last-leader-addr":"","last-leader-id":""}
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.787590Z","node":{},"term":495}
{"@level":"info","@message":"pre-vote successful, starting election","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.852815Z","refused":0,"tally":2,"term":495,"votesNeeded":2}
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906849Z","tally":2,"term":495}
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906884Z","leader":{}}
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906904Z","peer":"607347df-65b5-95b8-2ab3-6e0138a30db7"}
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906916Z","peer":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3"}
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2025-04-11T13:19:10.907793Z"}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.933525Z","peer":{"Suffrage":0,"ID":"607347df-65b5-95b8-2ab3-6e0138a30db7","Address":"172.16.1.235:4647"}}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.933853Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"info","@message":"eval broker status modified","@module":"nomad","@timestamp":"2025-04-11T13:19:10.956148Z","paused":false}
{"@level":"info","@message":"blocked evals status modified","@module":"nomad","@timestamp":"2025-04-11T13:19:10.956327Z","paused":false}
{"@level":"info","@message":"Promoting server","@module":"nomad.autopilot","@timestamp":"2025-04-11T13:19:20.958465Z","address":"172.16.1.236:4647","id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","name":"nomad03-bre.q-mex.net.bremen"}
{"@level":"info","@message":"updating configuration","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:20.958561Z","command":0,"server-addr":"172.16.1.236:4647","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","servers":"[{Suffrage:Voter ID:84f3b5a8-d979-5d41-c659-badc9ecce162 Address:172.16.1.234:4647} {Suffrage:Voter ID:607347df-65b5-95b8-2ab3-6e0138a30db7 Address:172.16.1.235:4647} {Suffrage:Voter ID:69c7c3cc-256f-e12e-e82c-68bb3f29e6a3 Address:172.16.1.236:4647}]"}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:20:05.373354Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":505378093}
{"@level":"warn","@message":"failed retrieving server health","@module":"nomad.stats_fetcher","@timestamp":"2025-04-11T13:20:05.963132Z","error":"context deadline exceeded","server":"nomad01-bre.q-mex.net.bremen"}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:21:40.059996Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":501376848}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:22:09.619104Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":519836348}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:22:10.196529Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":506603638}
Nomad Client logs (if appropriate)
Not applicable.
After rebooting all server nodes, all seems normal again 😕
I have no idea what to make of it. All metrics normal:
- No memory spikes
- lastContact is at 20ms += 30ms
- No constant leader elections / no more
failed to contactlogs
Yikes! Thanks for the report @EtienneBruines, and I'm glad a restart seems to have fixed it. I'm going to leave this open for now so we can do some investigation on our end. If you see anything like this again emailing cpu, memory, and goroutine profiles to [email protected] is greatly appreciated. That's a private list that goes to Nomad engineering at HashiCorp.
Taking a couple profiles a few minutes apart is useful for finding memory leaks as well. The nomad operator debug command should help or hit the profile endpoints directly: https://developer.hashicorp.com/nomad/api-docs/agent#agent-runtime-profiles
The fact that Raft was having trouble is worrying as well. Any chance you have CPU metrics for the servers? I wonder if something got stuck in a tight loop (potentially creating a lot of garbage which is why the memory does drop back down periodically). The Raft-related goroutines could get starved for CPU leading to what appears to be "network" latency.
not sure if related, but I'm also seeing a memory leak since upgrading to v1.9.7 @EtienneBruines what version were you using before v1.10.0 ?
The memory increase is also only really visible on 1 node, a follower node, not the leader.
1 april is when we started to upgrade
cpu is also a bit higher, it has 2 peaks though, but they don't seem related to the memory increase at the time, but they are really high.
@42wim We had upgraded from 1.9.7 to 1.10.0
@schmichael Thank you for your reply! We will keep an eye on it and send those profiles once it reoccurs.
The fact that Raft was having trouble is worrying as well. Any chance you have CPU metrics for the servers? I wonder if something got stuck in a tight loop (potentially creating a lot of garbage which is why the memory does drop back down periodically). The Raft-related goroutines could get starved for CPU leading to what appears to be "network" latency.
We thought the high CPU load was a 'symptom', caused by the constant leader switching. Guess we were wrong there.
Hmm. Last Friday I noticed all of our Filebeat instances with the add_nomad_metadata processor had been 'malfunctioning'. For every allocation (of which we have about 600-ish, before they get GC'ed after 24h), it'd request all allocations of a certain Nomad Client. It did so every few 5 seconds or so. I guess with that call being non-stale, all those requests ended up on the leader, overwhelming it, causing that starvation that you described.
@schmichael Would that be a reasonable theory?
None of that Filebeat is supported/created by HashiCorp, but since there's a chance others are also using this setup, I thought I'd mention it here.
For every allocation (of which we have about 600-ish, before they get GC'ed after 24h), it'd request all allocations of a certain Nomad Client. It did so every few 5 seconds or so. I guess with that call being non-stale, all those requests ended up on the leader, overwhelming it, causing that starvation that you described.
Reads should be very cheap, but this seems possible. I would expect msgpack and/or json serialization to feature heavily in a CPU profile in that case since allocation queries do very little processing otherwise.
Any idea what the total requests/second would have been and what the CPU capacity of your Nomad Servers are? With that info we might be able to reproduce the behavior.
@schmichael About 150 requests per second on a mere 2 core @ 2.7GHz VM, requesting /v1/node/NODE_ID/allocations?wait=5000.
I could also see some CPU overhead for all requests being made on non-leader server nodes, since they'd have to use encryption to communicate the request to the leader. I'm not sure on the specifics (whether a new TLS handshake would be needed, or whether it's basically symmetric encryption at this point), but probably overhead. 🤔
Perhaps (this is just a wild guess, not based on any evidence) there's some blocking queries not being cleaned up if they were made during a leader-change?
Was only the server leader ballooning? Could this be related to this ?
https://github.com/hashicorp/nomad/issues/18113
We have noticed all server nodes ballooning. However, with a leader election and transfer every 10-20 seconds, they all were leaders regularly. On Apr 16, 2025 11:07, Shantanu Gadgil @.> wrote: Was only the server leader ballooning? Could this be related to this ? #18113—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>
shantanugadgil left a comment (hashicorp/nomad#25659) Was only the server leader ballooning? Could this be related to this ? #18113
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Since this does not appear to be trivially or universally reproducible, I think we need more reports/data to proceed.
My 0.02:
What is the cluster job profile? For me this manifested in a huge way as I have lot of batch and cron jobs. Many (many) jobs running every minute and some every five minutes.
Also, for me it helped to narrow down the potential issue by enabling debug and taking a profile dump. BTW, the fact that prometheus metrics was causing this was an educated guess only after looking at the call graph generated by the profile (as guided by @tgross )
Hi mate,
I think I have same situation. My Nomad is hosted on GCP with an e2-small (share core, 2vCPU, 4GB memory + 6GB Swap). It's only used to run some cronjobs for web crawling every minute.
This my GCP dashboard for its metrics. You could see, almost two day, when I restart the machine, the memory utilization will go up constantly. And, at last, when the swap is also fully occupied, whole instance would be down.
So, I guess I might be a memory leak for Nomad v1.10.x ?
If you need more informations for dig it out, please let me know.
- System Info:
Ubuntu 24.04.3 LTS (GNU/Linux 6.14.0-1018-gcp x86_64)
- nomad version
Nomad v1.10.5 BuildDate 2025-09-09T14:36:45Z Revision a3b86c697f38ab032e1acaae8503ed10815bc4a2
Last 2 days:
Last week:
@payt0nc is prometeheus metrics enabled on the Nomad servers (and clients)? If enabled, is it being scraped properly?
Can you disable prometheus metrics to see if that changes anythings?
@payt0nc is prometeheus metrics enabled on the Nomad servers (and clients)? If enabled, is it being scraped properly?
Can you disable prometheus metrics to see if that changes anythings?
OK, now disabled. Give me some time to capture the data.
Give me some time to capture the data.
Although the graphs you shared are great it would really help us to figure out the exact problem if you were to get a heap profile (via /debug/pprof/heap) while the problem is happening. You can email that to [email protected] which is a write-only mailing list.
That being said:
My Nomad is hosted on GCP with an e2-small (share core, 2vCPU, 4GB memory + 6GB Swap). It's only used to run some cronjobs for web crawling every minute. ... And, at last, when the swap is also fully occupied, whole instance would be down.
Keep in mind that Nomad servers have an in-memory state store of all their jobs, nodes, etc. So if you're hitting swap everything is going to crawl pretty badly and if it OOMs that's it for the instance. From what you're describing this looks like you're sharing a single instance between server and client, so it's vital that you've reserved enough memory on the client to account for the server's in-memory state.
Got a heap dump from @payt0nc via the mailing list which looks like this:
This looks to be the issue that we fixed in https://github.com/hashicorp/nomad/pull/26737 which we shipped in Nomad 1.11.0. I'm not seeing it backported though and I can't remember why. Let me investigate that, but in the meantime it would be helpful to know which Nomad version produced the heap dump labelled heap-nomad-202511060323 in that email.
Edit: it wasn't backported because it was a breaking metrics change as noted here:
But this is a breaking metrics change, which is why it's not getting backported.