nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Potential memory leak on servers since 1.10.0

Open EtienneBruines opened this issue 10 months ago • 16 comments

Hi! The investigation on this issue is still ongoing, but I wanted to report issues with the latest release as soon as I got them.

Nomad version

Nomad v1.10.0 BuildDate 2025-04-09T16:40:54Z Revision e26a2bd2acac2dcdcb623f4d293bac096beef478

Operating system and Environment details

Ubuntu 24.04.2 LTS

Issue

Memory usage steadily increasing for Nomad Server instances.

Context to this screenshot:

  • The upgrade to v1.10.0 was yesterday between 12:00 and 16:00.
  • Regular SIGHUP-reloads are happening due to certificate renewals by consul-template.
  • The cluster has been wanting a new leader quite a lot since the upgrade. Easily once every 10 minutes.

All three server nodes seem affected by this. For example regular spikes up to 3GB and then down to 300MB.

Image

Reproduction steps

  • Have a 3-node Server cluster
    • Probably unrelated: We have 3 clients, and then another 3 Server nodes in another region federated. The problematic region is the authoritative one
    • The memory usage in the federated (non-authorattive) region seems unaffected - stable and low.
  • Wait a day

Expected Result

Memory usage to stay roughly the same.

Actual Result

Memory usage increasing.

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Here I am noticing that my cluster became very unstable since the 1.10 upgrade.

Also worth noting: Network latency between nodes is less than 1ms (same rack), but the nomad.raft.leader.lastContact metric showed values between 200 and 600ms (+= 200ms). Not great. This metric on the non-authoritative region is about 1-100 += 26ms, which is great (despite being slower hardware). According to Consul:

# roundtrip time
Minimum  0.43ms
Median 0.7ms
Maximum 0.8ms
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.332364Z","peer":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3"}
{"@level":"warn","@message":"appendEntries rejected, sending older logs","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.338291Z","next":154265,"peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.400207Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187027Z","server-id":"607347df-65b5-95b8-2ab3-6e0138a30db7","time":508388073}
{"@level":"warn","@message":"failed to contact quorum of nodes, stepping down","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187102Z"}
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187137Z","follower":{},"leader-address":"","leader-id":""}
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.249755Z","peer":{"Suffrage":1,"ID":"607347df-65b5-95b8-2ab3-6e0138a30db7","Address":"172.16.1.235:4647"}}
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2025-04-11T13:19:09.323127Z"}
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.560093Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.787542Z","last-leader-addr":"","last-leader-id":""}
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.787590Z","node":{},"term":495}
{"@level":"info","@message":"pre-vote successful, starting election","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.852815Z","refused":0,"tally":2,"term":495,"votesNeeded":2}
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906849Z","tally":2,"term":495}
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906884Z","leader":{}}
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906904Z","peer":"607347df-65b5-95b8-2ab3-6e0138a30db7"}
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906916Z","peer":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3"}
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2025-04-11T13:19:10.907793Z"}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.933525Z","peer":{"Suffrage":0,"ID":"607347df-65b5-95b8-2ab3-6e0138a30db7","Address":"172.16.1.235:4647"}}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.933853Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"info","@message":"eval broker status modified","@module":"nomad","@timestamp":"2025-04-11T13:19:10.956148Z","paused":false}
{"@level":"info","@message":"blocked evals status modified","@module":"nomad","@timestamp":"2025-04-11T13:19:10.956327Z","paused":false}
{"@level":"info","@message":"Promoting server","@module":"nomad.autopilot","@timestamp":"2025-04-11T13:19:20.958465Z","address":"172.16.1.236:4647","id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","name":"nomad03-bre.q-mex.net.bremen"}
{"@level":"info","@message":"updating configuration","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:20.958561Z","command":0,"server-addr":"172.16.1.236:4647","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","servers":"[{Suffrage:Voter ID:84f3b5a8-d979-5d41-c659-badc9ecce162 Address:172.16.1.234:4647} {Suffrage:Voter ID:607347df-65b5-95b8-2ab3-6e0138a30db7 Address:172.16.1.235:4647} {Suffrage:Voter ID:69c7c3cc-256f-e12e-e82c-68bb3f29e6a3 Address:172.16.1.236:4647}]"}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:20:05.373354Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":505378093}
{"@level":"warn","@message":"failed retrieving server health","@module":"nomad.stats_fetcher","@timestamp":"2025-04-11T13:20:05.963132Z","error":"context deadline exceeded","server":"nomad01-bre.q-mex.net.bremen"}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:21:40.059996Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":501376848}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:22:09.619104Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":519836348}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:22:10.196529Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":506603638}

Nomad Client logs (if appropriate)

Not applicable.

EtienneBruines avatar Apr 11 '25 13:04 EtienneBruines

After rebooting all server nodes, all seems normal again 😕

I have no idea what to make of it. All metrics normal:

  • No memory spikes
  • lastContact is at 20ms += 30ms
  • No constant leader elections / no more failed to contact logs

EtienneBruines avatar Apr 11 '25 13:04 EtienneBruines

Yikes! Thanks for the report @EtienneBruines, and I'm glad a restart seems to have fixed it. I'm going to leave this open for now so we can do some investigation on our end. If you see anything like this again emailing cpu, memory, and goroutine profiles to [email protected] is greatly appreciated. That's a private list that goes to Nomad engineering at HashiCorp.

Taking a couple profiles a few minutes apart is useful for finding memory leaks as well. The nomad operator debug command should help or hit the profile endpoints directly: https://developer.hashicorp.com/nomad/api-docs/agent#agent-runtime-profiles

The fact that Raft was having trouble is worrying as well. Any chance you have CPU metrics for the servers? I wonder if something got stuck in a tight loop (potentially creating a lot of garbage which is why the memory does drop back down periodically). The Raft-related goroutines could get starved for CPU leading to what appears to be "network" latency.

schmichael avatar Apr 11 '25 18:04 schmichael

not sure if related, but I'm also seeing a memory leak since upgrading to v1.9.7 @EtienneBruines what version were you using before v1.10.0 ?

The memory increase is also only really visible on 1 node, a follower node, not the leader.

Image

1 april is when we started to upgrade

cpu is also a bit higher, it has 2 peaks though, but they don't seem related to the memory increase at the time, but they are really high.

Image

Image

42wim avatar Apr 13 '25 20:04 42wim

@42wim We had upgraded from 1.9.7 to 1.10.0

EtienneBruines avatar Apr 13 '25 20:04 EtienneBruines

@schmichael Thank you for your reply! We will keep an eye on it and send those profiles once it reoccurs.

The fact that Raft was having trouble is worrying as well. Any chance you have CPU metrics for the servers? I wonder if something got stuck in a tight loop (potentially creating a lot of garbage which is why the memory does drop back down periodically). The Raft-related goroutines could get starved for CPU leading to what appears to be "network" latency.

We thought the high CPU load was a 'symptom', caused by the constant leader switching. Guess we were wrong there.

Hmm. Last Friday I noticed all of our Filebeat instances with the add_nomad_metadata processor had been 'malfunctioning'. For every allocation (of which we have about 600-ish, before they get GC'ed after 24h), it'd request all allocations of a certain Nomad Client. It did so every few 5 seconds or so. I guess with that call being non-stale, all those requests ended up on the leader, overwhelming it, causing that starvation that you described.

@schmichael Would that be a reasonable theory?

None of that Filebeat is supported/created by HashiCorp, but since there's a chance others are also using this setup, I thought I'd mention it here.

EtienneBruines avatar Apr 14 '25 08:04 EtienneBruines

For every allocation (of which we have about 600-ish, before they get GC'ed after 24h), it'd request all allocations of a certain Nomad Client. It did so every few 5 seconds or so. I guess with that call being non-stale, all those requests ended up on the leader, overwhelming it, causing that starvation that you described.

Reads should be very cheap, but this seems possible. I would expect msgpack and/or json serialization to feature heavily in a CPU profile in that case since allocation queries do very little processing otherwise.

Any idea what the total requests/second would have been and what the CPU capacity of your Nomad Servers are? With that info we might be able to reproduce the behavior.

schmichael avatar Apr 14 '25 18:04 schmichael

@schmichael About 150 requests per second on a mere 2 core @ 2.7GHz VM, requesting /v1/node/NODE_ID/allocations?wait=5000.

I could also see some CPU overhead for all requests being made on non-leader server nodes, since they'd have to use encryption to communicate the request to the leader. I'm not sure on the specifics (whether a new TLS handshake would be needed, or whether it's basically symmetric encryption at this point), but probably overhead. 🤔

Perhaps (this is just a wild guess, not based on any evidence) there's some blocking queries not being cleaned up if they were made during a leader-change?

EtienneBruines avatar Apr 14 '25 18:04 EtienneBruines

Was only the server leader ballooning? Could this be related to this ?

https://github.com/hashicorp/nomad/issues/18113

shantanugadgil avatar Apr 16 '25 09:04 shantanugadgil

We have noticed all server nodes ballooning. However, with a leader election and transfer every 10-20 seconds, they all were leaders regularly. On Apr 16, 2025 11:07, Shantanu Gadgil @.> wrote: Was only the server leader ballooning? Could this be related to this ? #18113—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.>

shantanugadgil left a comment (hashicorp/nomad#25659) Was only the server leader ballooning? Could this be related to this ? #18113

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

EtienneBruines avatar Apr 16 '25 09:04 EtienneBruines

Since this does not appear to be trivially or universally reproducible, I think we need more reports/data to proceed.

schmichael avatar Apr 23 '25 20:04 schmichael

My 0.02:

What is the cluster job profile? For me this manifested in a huge way as I have lot of batch and cron jobs. Many (many) jobs running every minute and some every five minutes.

Also, for me it helped to narrow down the potential issue by enabling debug and taking a profile dump. BTW, the fact that prometheus metrics was causing this was an educated guess only after looking at the call graph generated by the profile (as guided by @tgross )

shantanugadgil avatar Apr 24 '25 08:04 shantanugadgil

Hi mate,

I think I have same situation. My Nomad is hosted on GCP with an e2-small (share core, 2vCPU, 4GB memory + 6GB Swap). It's only used to run some cronjobs for web crawling every minute.

This my GCP dashboard for its metrics. You could see, almost two day, when I restart the machine, the memory utilization will go up constantly. And, at last, when the swap is also fully occupied, whole instance would be down.

So, I guess I might be a memory leak for Nomad v1.10.x ?

If you need more informations for dig it out, please let me know.

  • System Info:

Ubuntu 24.04.3 LTS (GNU/Linux 6.14.0-1018-gcp x86_64)

  • nomad version

Nomad v1.10.5 BuildDate 2025-09-09T14:36:45Z Revision a3b86c697f38ab032e1acaae8503ed10815bc4a2

Last 2 days: Image

Last week:

Image

payt0nc avatar Oct 30 '25 13:10 payt0nc

@payt0nc is prometeheus metrics enabled on the Nomad servers (and clients)? If enabled, is it being scraped properly?

Can you disable prometheus metrics to see if that changes anythings?

shantanugadgil avatar Oct 30 '25 16:10 shantanugadgil

@payt0nc is prometeheus metrics enabled on the Nomad servers (and clients)? If enabled, is it being scraped properly?

Can you disable prometheus metrics to see if that changes anythings?

OK, now disabled. Give me some time to capture the data.

payt0nc avatar Oct 31 '25 04:10 payt0nc

Give me some time to capture the data.

Although the graphs you shared are great it would really help us to figure out the exact problem if you were to get a heap profile (via /debug/pprof/heap) while the problem is happening. You can email that to [email protected] which is a write-only mailing list.

That being said:

My Nomad is hosted on GCP with an e2-small (share core, 2vCPU, 4GB memory + 6GB Swap). It's only used to run some cronjobs for web crawling every minute. ... And, at last, when the swap is also fully occupied, whole instance would be down.

Keep in mind that Nomad servers have an in-memory state store of all their jobs, nodes, etc. So if you're hitting swap everything is going to crawl pretty badly and if it OOMs that's it for the instance. From what you're describing this looks like you're sharing a single instance between server and client, so it's vital that you've reserved enough memory on the client to account for the server's in-memory state.

tgross avatar Oct 31 '25 12:10 tgross

Got a heap dump from @payt0nc via the mailing list which looks like this:

Image

This looks to be the issue that we fixed in https://github.com/hashicorp/nomad/pull/26737 which we shipped in Nomad 1.11.0. I'm not seeing it backported though and I can't remember why. Let me investigate that, but in the meantime it would be helpful to know which Nomad version produced the heap dump labelled heap-nomad-202511060323 in that email.

Edit: it wasn't backported because it was a breaking metrics change as noted here:

But this is a breaking metrics change, which is why it's not getting backported.

tgross avatar Dec 01 '25 16:12 tgross