nomad
nomad copied to clipboard
Documentation Enhancement - Garbage Collection
Proposal
I propose a new document under nomad/website/content/docs/internal for documenting Nomad's garbage collection behavior.
Use-cases
The existing documentation lists the various configurations associated with GC, but as this is spread across both server
and client
configuration, there is no singular place to learn about how GC is implemented by Nomad & what are the available ways to tune it.
Some questions that I think would be useful for such a document to answer are:
-
What are the events that explicitly trigger a GC run? For example, for allocations, the code suggests that the only triggers for GC are (1) the
gc_interval
elapsing, or (2) allocations being created/terminated, or (3) server-side removals triggered by API/CLI calls (nomad system gc
) or GCs of evaluations (see cascading bullet below). The existing documentation IMO could be misinterpreted to mean that GC runs are also triggered by the disk/inode thresholds being surpassed (e.g. as if the Nomad client watching/polling its host's disk usage continuously), which is not the case. Trigger (2) is also not mentioned in any of the docs, which can lead a reader to mistakenly believe alloc GCs are only triggered atgc_interval
elapsing. - Once an allocation GC is triggered, how many allocations will be destroyed and how are they chosen? The code suggests that for triggers (1) and (2) above, allocations are removed in termination-time order until no disk/inode/max-alloc thresholds are surpassed, and that only in case (3) are allocations destroyed if none of these thresholds are surpassed. I wasn't able to find this information from the docs alone.
- What are the triggers for non-allocation GC runs?
-
What are available configurations for how non-alloc resources are chosen for GC? This is obvious from reading
server
stanza documentation; but what's not immediately obvious, without reading through all ofclient
gc_*
andserver
*_gc_*
stanza configurations, is thatserver
configs are only for non-alloc resources (except on cascading GC's -- see below), andclient
configs are only for alloc resources. Similarly, not all resources have the same available controls -- e.g. allocations do not have something like analloc_gc_threshold
configuration similar to(job|deployment|eval)_gc_threshold
. Having one place that lists out the GC'able resources and their associated controls across client & server config would be useful. -
What sorts of cascading GC (if any) exist? For example, if a job resource is GC'd, does that include all of its deployments/evals/allocations as well? Or, if an eval is GC'd on the server, are the terminal allocations also GC'd? This latter case appears to be true, but I wasn't able to find it mentioned in the existing docs (save for this comment block), and as this effectively makes the server's
eval_gc_threshold
config an implicit age threshold on terminal allocations, it'd be useful to document this so that it can be tuned accordingly alongside the client-side alloc parameters.
I understand that there may be things that committing to documentation would make future optimizations more difficult (e.g. not committing an order of alloc termination so that a future implementation could, for example, destroy allocs based on disk usage if that's the threshold the client's trying to drop below). I think it'd be reasonable to leave out anything that there isn't a desire to commit to in the docs, and maybe to call out that anything not described explicitly is subject to change.
Attempted Solutions
It's possible to glean a full picture by searching for gc
across the client
& server
stanza configuration docs, and by reading the client/gc.go for the client-side alloc GC algorithm, client/client.go for the various non-timer GC triggers, and the server-side code (e.g. nomad/core_sched.go) for other cases that can trigger GC, but this is not ideal & toilsome to share with non-Go-developer audiences within an organization that have a stake in GC configuration.
(Apologies in advance if this documentation exists in other forms and I wasn't able to find it; if that's the case, I propose linking to those docs from the client/server config docs.)