nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Terminal allocations are incorrectly GCed in LIFO order

Open lukepalmer opened this issue 1 year ago • 0 comments

Nomad version

1.6.3, but the bug exists at tip

Operating system and Environment details

Rocky Linux 8

Issue

The Nomad allocation garbage collector collects allocations backward: it collects the newest allocations first and potentially will never collect the oldest allocations.

The class of issue that this causes is that it is helpful for containers to remain on disk for some non-zero time after allocations are terminal. One application is log shipping, where an external shipper must look into the container to find logs. Under GC pressure a terminal allocation can vainish immediately because of this bug.

Reproduction steps

Submit jobs such that a GC limit will be hit. The easiest is gc_max_allocs which defaults to 50, so just submit more jobs than that onto one node.

Expected Result

When faced with GC pressure, the Nomad client GCs the oldest terminal allocations.

Actual Result

When faced with GC pressure, the Nomad client GCs the newest terminal allocations.

Here is the bug: https://github.com/hashicorp/nomad/blob/f3de47e63dfd14971ce8acfacf36203e00bc7364/client/gc.go#L380 The data structure "GCAllocPQImpl" is not a priority queue at all, but in reality is implemented as a stack!

lukepalmer avatar Jul 11 '24 15:07 lukepalmer