nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Memory leak on nomad servers

Open D4GGe opened this issue 1 year ago • 6 comments

Nomad version

Nomad v1.3.1 (bc49012743edc0e2adb565e0fbcbc36c184f8ce4+CHANGES)

Operating system and Environment details

3 nomad servers

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

8 cores
32 GiB ram

~10 large nomad clients

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

96 Cores
384 GiB ram
8 x nvidia t4

No nvidia plugin installed

~20 nomad clients of various sizes

Issue

Memory leak on nomad servers

We have a memory leak on our nomad servers, the cluster have around 30 clients and the memory leak shows when we run a lot of of short lived jobs (parameterized job ~100 new allocs per hour), but seams to build up even if we lower the load just slower.

The only way for us to fix this is to run an nomad system gc on one of the servers. Otherwise it uses up all memory on the server until the server needs a reboot to be usable again. When this happens the only way to start nomad without nomad using 100% memory is to remove the data_dir and start over.

Right now in our production we run automatically nomad system gc each time memory goes over 30% and this is how it looks.

screenshot

Are there any way of analyzing what actually take up this amount of memory? Can I analyze the data_dir in any meaningful way? This log show up ever now and then can it have anything to do with it?

2022-10-07T03:53:13.369Z [ERROR] worker: error waiting for Raft index: worker_id=599a4aeb-925e-0764-362f-844469f18e71 error="timed out after 5s waiting for index=23307547" index=23307547

Configuration

Server

data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
region = "{{region}}"
datacenter = "dc-master"
advertise {
	rpc = "{{ipify_public_ip}}"
}
server {
  enabled = true
  bootstrap_expect = 3
  encrypt = "{{gossipkey}}"
   server_join {
    retry_join = ["{{ipa}}", "{{ipb}}", "{{ipc}}"]
    retry_max = 0
    retry_interval = "15s"
  }
   default_scheduler_config {
    scheduler_algorithm = "spread"

    preemption_config {
      batch_scheduler_enabled   = false
      system_scheduler_enabled  = true
      service_scheduler_enabled = false
    }
  }
}

client {
  enabled = false
}


tls {
  http = false
  rpc  = true

  ca_file   = "--"
  cert_file = "--"
  key_file  = "--"

  verify_server_hostname = true
  verify_https_client    = true
}

Clients

data_dir = "/var/lib/nomad"
bind_addr = "0.0.0.0"
region = "{{region}}"
datacenter = "dc-edge-node"

plugin "raw_exec" {
  config {
    enabled = true
  }
}



plugin "docker" {
  config {
    volumes {
      enabled = true
    }
    gc {
      image       = true
      dangling_containers {
        period         = "12h"
      }
    }
  }
}

client {
  cpu_total_compute = {{ 1170000 if num_gpus else 0 }}
  servers = ["{{ipa}}", "{{ipb}}", "{{ipc}}"]
  enabled = true
  node_class = "{{nodeclass}}"
  ip_resolver_endpoint = "{{ipresolverendpoint}}"
  meta {
    gpu.numcores = {{num_gpus}}
    features = "{{features}}"
    region = "{{nodeclass}}"
    service = "{{service}}"
    edge_node.environment = "{{edgenodeenvironment}}"
    edge_node.datacenter_cache_endpoint = "{{datacentercacheendpoint}}"
  }
  options {
    gc_interval = "1m",
    gc_max_allocs = 50,
    docker.cleanup.image = "false"
  }
}
tls {
  http = true
  rpc  = true

  ca_file   = "--"
  cert_file = "--"
  key_file  = "--"

  verify_server_hostname = true
  verify_https_client    = true
}

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

We have many jobs running but this is the main one used with Levant job.hcl

Nomad Server logs (if appropriate)

This is some sample logs from the server

server-logs.txt

Nomad Client logs (if appropriate)

D4GGe avatar Oct 07 '22 08:10 D4GGe

Hi @D4GGe and thanks for raising this issue. I don't believe you are experiencing a memory leak on the servers because, as you mention, running the Nomad internal GC process reduces the memory overhead down. It seems like this is therefore normal behaviour on Nomad's part and I will try to explain in a little more detail below what is happening along with some configuration options that might help.

run a lot of of short lived jobs (parameterized job ~100 new allocs per hour)

Each job registration results in a number of objects (jobspec, evaluation, allocation, deployment, etc...) being created and subsequently stored within the Nomad state store. This state store is held stored in memory and replicated to each server via Raft from the leader. I would therefore expect Nomads memory usage to grow as more jobs and allocations are added to the cluster.

fix this is to run an nomad system gc on one of the servers

The Nomad leader runs a number of internal routines to clean up its state store at regular intervals. This interval usually defaults to 5 minutes. At each interval the leader will identify state objects such as jobs, evaluations, allocations, and deployments which have reached a terminal state, are no longer needed as reference by other objects, and have been in this terminal state for a configurable threshold time.

The nomad system gc command forces a run of all internal garbage collection processes and bypasses the threshold time for deletion. This results in any object that is considered terminal and collectable being deleted from state no matter how old it is.

The GC intervals and thresholds can be modified via the server configuration block to be more aggressive to account for high job turnover. There are a number of options; the job_gc_interval and job_gc_threshold would be good places to start.

Are there any way of analyzing what actually take up this amount of memory? Can I analyze the data_dir in any meaningful way?

The Nomad API would provide some information about what objects are being stored within state such as jobs, allocations, and allocations. Another option would be to use the operator debug CLI tool to dump out information regarding your system for further investigation. This is a common practice we use internally and will provide a wide range of data.

The data directory on the server most importantly holds the boltdb data which contains Raft log entries. This allows the Nomad process to restart and reload data without any data loss. It is possible to view this data with boltdb CLI tools, however, I do not believe it would be useful in this case.

timed out after 5s waiting

This could be an indication of disk contention and performance issues or general Raft contention but I can't identify that from any information presented. It is not related to the memory consumption you are seeing.

Hopefully this explanation makes sense and is useful. Please let me know if you have any follow up questions, otherwise I will close this issue after a few days.

jrasell avatar Oct 07 '22 09:10 jrasell

Thank you for really good explanation! I will experiment with job_gc_interval and job_gc_threshold and come back!

Would it also help to increase numbers of servers and or increase the performance of them. How much performance is common to need to have at the nomad servers in a case like ours?

D4GGe avatar Oct 07 '22 09:10 D4GGe

Would it also help to increase numbers of servers and or increase the performance of them.

I don't believe increasing the number of servers will have any effect as each server stores a replica of the state store in memory. It has the potential to make situations worse, as the data needs to be replicate more times from the leader. Depending on how tuning the GC parameters goes, it could be beneficial to increase the available RAM for the servers, but I am confident the GC modifications are the correct solution.

How much performance is common to need to have at the nomad servers in a case like ours?

We don't have exact figure or data and each environment is different. That being said, I have certainly seen this pattern before in clusters which have high turnover rates of batch workloads. As with above, I would expect the GC tuning to help make this a situation where the current specs are enough.

jrasell avatar Oct 07 '22 12:10 jrasell

Hi I changed the parameters to this

  job_gc_interval = "30s" // defult 5m
  job_gc_threshold = "30m"  // defult 4h
  eval_gc_threshold = "15m" // defult 1h

And memory consumption seams to going up slower But we can still see a memory curve that seams to not stop going up. This is over 3 days so we automatically run nomad system gc when it comes up to 30% memory image

Do recommend to test with an even lower job_gc_threshold or are there other parameters I should play around with? Also the plan is for the load to go up meaning higher turn over of jobs (in the 1000s per hour) is this at all sustainable with nomad?

D4GGe avatar Oct 10 '22 12:10 D4GGe

Hello, I am not sure it will be helpful but I will share some observations on our side regarding a very similar issue that could be related. We are using an operator that will in loop apply jobs based on a GIT repository.

Each time the job is applied, it will "re-register" the job (we can see it in the evaluations history) but do nothing as there is no change or no version change of the job. However, we notice that the json output of: https://nomad.myenvironment.com/v1/job/mysuperjob-batch/evaluations is going to slowly grow over time (more there is replica of the task in the job, faster it will grow). After some weeks, this json can reach 30MB and loading the job page in the UI take some seconds (very slow) or even be the origin of OOM due to a spike of memory and nomad server OOM.

Forcing a "nomad system gc" doesn't change anything because it seems that it won't GC the evaluations entries as long as the job exist. However, removing the job and applying a "nomad system gc" before the operator recreate the job with the same name do reset the evaluations and reduce the average used memory from 5Go to 780Mo instantly.

So, in our case, the fix could be:

  • do not use the operator as is and avoid the re-register if not necessary
  • that the GC do clean the evaluations even for existing and running jobs instead of keeping them only until the job change name or get removed

I hope this little observation will be useful to understand this behavior and find a common denominator. Thanks !

EDIT: Seems a ticket about this eval cleaning behavior do exist: https://github.com/hashicorp/nomad/issues/10788

arsiesys avatar Oct 11 '22 20:10 arsiesys

Hi @D4GGe and @arsiesys, thanks for all the additional information. It seems like there are potentially two related issues being observed here (thanks for linking those), so I will mark that this requires further investigation and add it to our backlog.

If you find any additional information, please feel free to add it to this issue.

jrasell avatar Oct 12 '22 07:10 jrasell

Hey there!

I was writing another bug report when I was pointed to this one. I think I might have rootcaused the problem. Please see the linked issue above.

stswidwinski avatar Oct 31 '22 19:10 stswidwinski