nomad
nomad copied to clipboard
Memory leak on nomad servers
Nomad version
Nomad v1.3.1 (bc49012743edc0e2adb565e0fbcbc36c184f8ce4+CHANGES)
Operating system and Environment details
3 nomad servers
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
8 cores
32 GiB ram
~10 large nomad clients
Distributor ID: Ubuntu
Description: Ubuntu 20.04.1 LTS
Release: 20.04
Codename: focal
96 Cores
384 GiB ram
8 x nvidia t4
No nvidia plugin installed
~20 nomad clients of various sizes
Issue
Memory leak on nomad servers
We have a memory leak on our nomad servers, the cluster have around 30 clients and the memory leak shows when we run a lot of of short lived jobs (parameterized job ~100 new allocs per hour), but seams to build up even if we lower the load just slower.
The only way for us to fix this is to run an nomad system gc
on one of the servers. Otherwise it uses up all memory on the server until the server needs a reboot to be usable again.
When this happens the only way to start nomad without nomad using 100% memory is to remove the data_dir
and start over.
Right now in our production we run automatically nomad system gc
each time memory goes over 30% and this is how it looks.
Are there any way of analyzing what actually take up this amount of memory?
Can I analyze the data_dir
in any meaningful way?
This log show up ever now and then can it have anything to do with it?
2022-10-07T03:53:13.369Z [ERROR] worker: error waiting for Raft index: worker_id=599a4aeb-925e-0764-362f-844469f18e71 error="timed out after 5s waiting for index=23307547" index=23307547
Configuration
Server
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
region = "{{region}}"
datacenter = "dc-master"
advertise {
rpc = "{{ipify_public_ip}}"
}
server {
enabled = true
bootstrap_expect = 3
encrypt = "{{gossipkey}}"
server_join {
retry_join = ["{{ipa}}", "{{ipb}}", "{{ipc}}"]
retry_max = 0
retry_interval = "15s"
}
default_scheduler_config {
scheduler_algorithm = "spread"
preemption_config {
batch_scheduler_enabled = false
system_scheduler_enabled = true
service_scheduler_enabled = false
}
}
}
client {
enabled = false
}
tls {
http = false
rpc = true
ca_file = "--"
cert_file = "--"
key_file = "--"
verify_server_hostname = true
verify_https_client = true
}
Clients
data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
region = "{{region}}"
datacenter = "dc-edge-node"
plugin "raw_exec" {
config {
enabled = true
}
}
plugin "docker" {
config {
volumes {
enabled = true
}
gc {
image = true
dangling_containers {
period = "12h"
}
}
}
}
client {
cpu_total_compute = {{ 1170000 if num_gpus else 0 }}
servers = ["{{ipa}}", "{{ipb}}", "{{ipc}}"]
enabled = true
node_class = "{{nodeclass}}"
ip_resolver_endpoint = "{{ipresolverendpoint}}"
meta {
gpu.numcores = {{num_gpus}}
features = "{{features}}"
region = "{{nodeclass}}"
service = "{{service}}"
edge_node.environment = "{{edgenodeenvironment}}"
edge_node.datacenter_cache_endpoint = "{{datacentercacheendpoint}}"
}
options {
gc_interval = "1m",
gc_max_allocs = 50,
docker.cleanup.image = "false"
}
}
tls {
http = true
rpc = true
ca_file = "--"
cert_file = "--"
key_file = "--"
verify_server_hostname = true
verify_https_client = true
}
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
We have many jobs running but this is the main one used with Levant job.hcl
Nomad Server logs (if appropriate)
This is some sample logs from the server
Nomad Client logs (if appropriate)
Hi @D4GGe and thanks for raising this issue. I don't believe you are experiencing a memory leak on the servers because, as you mention, running the Nomad internal GC process reduces the memory overhead down. It seems like this is therefore normal behaviour on Nomad's part and I will try to explain in a little more detail below what is happening along with some configuration options that might help.
run a lot of of short lived jobs (parameterized job ~100 new allocs per hour)
Each job registration results in a number of objects (jobspec, evaluation, allocation, deployment, etc...) being created and subsequently stored within the Nomad state store. This state store is held stored in memory and replicated to each server via Raft from the leader. I would therefore expect Nomads memory usage to grow as more jobs and allocations are added to the cluster.
fix this is to run an
nomad system gc
on one of the servers
The Nomad leader runs a number of internal routines to clean up its state store at regular intervals. This interval usually defaults to 5 minutes. At each interval the leader will identify state objects such as jobs, evaluations, allocations, and deployments which have reached a terminal state, are no longer needed as reference by other objects, and have been in this terminal state for a configurable threshold time.
The nomad system gc
command forces a run of all internal garbage collection processes and bypasses the threshold time for deletion. This results in any object that is considered terminal and collectable being deleted from state no matter how old it is.
The GC intervals and thresholds can be modified via the server configuration block to be more aggressive to account for high job turnover. There are a number of options; the job_gc_interval and job_gc_threshold would be good places to start.
Are there any way of analyzing what actually take up this amount of memory? Can I analyze the data_dir in any meaningful way?
The Nomad API would provide some information about what objects are being stored within state such as jobs, allocations, and allocations. Another option would be to use the operator debug CLI tool to dump out information regarding your system for further investigation. This is a common practice we use internally and will provide a wide range of data.
The data directory on the server most importantly holds the boltdb data which contains Raft log entries. This allows the Nomad process to restart and reload data without any data loss. It is possible to view this data with boltdb CLI tools, however, I do not believe it would be useful in this case.
timed out after 5s waiting
This could be an indication of disk contention and performance issues or general Raft contention but I can't identify that from any information presented. It is not related to the memory consumption you are seeing.
Hopefully this explanation makes sense and is useful. Please let me know if you have any follow up questions, otherwise I will close this issue after a few days.
Thank you for really good explanation! I will experiment with job_gc_interval and job_gc_threshold and come back!
Would it also help to increase numbers of servers and or increase the performance of them. How much performance is common to need to have at the nomad servers in a case like ours?
Would it also help to increase numbers of servers and or increase the performance of them.
I don't believe increasing the number of servers will have any effect as each server stores a replica of the state store in memory. It has the potential to make situations worse, as the data needs to be replicate more times from the leader. Depending on how tuning the GC parameters goes, it could be beneficial to increase the available RAM for the servers, but I am confident the GC modifications are the correct solution.
How much performance is common to need to have at the nomad servers in a case like ours?
We don't have exact figure or data and each environment is different. That being said, I have certainly seen this pattern before in clusters which have high turnover rates of batch workloads. As with above, I would expect the GC tuning to help make this a situation where the current specs are enough.
Hi I changed the parameters to this
job_gc_interval = "30s" // defult 5m
job_gc_threshold = "30m" // defult 4h
eval_gc_threshold = "15m" // defult 1h
And memory consumption seams to going up slower
But we can still see a memory curve that seams to not stop going up. This is over 3 days so we automatically run nomad system gc when it comes up to 30% memory
Do recommend to test with an even lower job_gc_threshold
or are there other parameters I should play around with?
Also the plan is for the load to go up meaning higher turn over of jobs (in the 1000s per hour) is this at all sustainable with nomad?
Hello, I am not sure it will be helpful but I will share some observations on our side regarding a very similar issue that could be related. We are using an operator that will in loop apply jobs based on a GIT repository.
Each time the job is applied, it will "re-register" the job (we can see it in the evaluations history) but do nothing as there is no change or no version change of the job. However, we notice that the json output of: https://nomad.myenvironment.com/v1/job/mysuperjob-batch/evaluations is going to slowly grow over time (more there is replica of the task in the job, faster it will grow). After some weeks, this json can reach 30MB and loading the job page in the UI take some seconds (very slow) or even be the origin of OOM due to a spike of memory and nomad server OOM.
Forcing a "nomad system gc" doesn't change anything because it seems that it won't GC the evaluations entries as long as the job exist. However, removing the job and applying a "nomad system gc" before the operator recreate the job with the same name do reset the evaluations and reduce the average used memory from 5Go to 780Mo instantly.
So, in our case, the fix could be:
- do not use the operator as is and avoid the re-register if not necessary
- that the GC do clean the evaluations even for existing and running jobs instead of keeping them only until the job change name or get removed
I hope this little observation will be useful to understand this behavior and find a common denominator. Thanks !
EDIT: Seems a ticket about this eval cleaning behavior do exist: https://github.com/hashicorp/nomad/issues/10788
Hi @D4GGe and @arsiesys, thanks for all the additional information. It seems like there are potentially two related issues being observed here (thanks for linking those), so I will mark that this requires further investigation and add it to our backlog.
If you find any additional information, please feel free to add it to this issue.
Hey there!
I was writing another bug report when I was pointed to this one. I think I might have rootcaused the problem. Please see the linked issue above.
Fixed in #15097 which will ship in Nomad 1.5.0 (with backports)