nomad [15090] Ensure no leakage of evaluations for batch jobs.

Prior to https://github.com/hashicorp/nomad/commit/2409f72d0d42a3225d59347d2604f005bc132636 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to memory leakage as reported here:

https://github.com/hashicorp/nomad/issues/15090 https://github.com/hashicorp/nomad/issues/14842 https://github.com/hashicorp/nomad/issues/4532

The comments within the code and tests are self-contradictory. Some state that evals of batch jobs should never be GCed others claim that they should be GCed if a new job version is created (as determined by comparing the indexes). I believe that the latter is the expected state.

Thus, we compare the creation index of the allocation with the modification index of the job to determine whether or not an alloc belongs to the current job version or not.

This pull request specifically:

Removes references to alloc.Job which is unsafe given the deserialization of allocs from memDB which may include invalid pointers
Fixes the logical unity described above (compare creation with modification and not creation with creation)
Fixes the test (the test breaks the assumption of alloc reparenting, thus testing the mock rather than production code)
Adds more tests

Any and all feedback welcome.

For more details, please see the issues linked. Specifically https://github.com/hashicorp/nomad/issues/15090 describes the issue in detail.

EDIT: I am assuming here that an update to the job results in an eval and consequently allocs for the job. If this is not correct than we must take into account the ordering of evals/allocs which has not been done prior.

Nov 01 '22 15:11 stswidwinski

All committers have signed the CLA.

Nov 01 '22 15:11 hashicorp-cla

cc: @jrasell

Nov 02 '22 17:11 stswidwinski

I need to test this a little bit more, but I believe that the current code actually references relatively random memory which is the result of dereferences of pointers which are serialized and de-serialized with the underlying object being non-static (that is: rellocable).

In particular the de-serialization proceeds by using the eval id as an index (https://github.com/hashicorp/nomad/blob/main/nomad/core_sched.go#L294). The GC logic is quite careful not to address pointers in structs except for this one place.

EDIT: It seems that my hypothesis was correct. I have deployed it to a test cluster and observed that the correct gc timeouts on evals are now applied to batch jobs and we do not start allocations when we shouldn't. I will modify the fix here and open up for review tomorrow.

Nov 02 '22 19:11 stswidwinski

Any progress of merging this? Having huge problems in our environment with this!

Nov 29 '22 08:11 D4GGe

At this point we're waiting for Nomad folks to reproduce and reason about the initial reason for the existing code. Given that it's not uncommon for bugs to hide bugs, I appreciate the scrutiny. I hope we will hear back soon.

Take a look at the issue attached to the PR for the discussion so far.

On Tue, Nov 29, 2022, 3:16 AM Daniel Fredriksson @.***> wrote:

Any progress of merging this? Having huge problems in our environment with this!

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/pull/15097#issuecomment-1330247212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECDANFJPWCK3JHAF63VKVDWKW3UPANCNFSM6AAAAAARUF57ZY . You are receiving this because you authored the thread.Message ID: @.***>

Nov 29 '22 10:11 stswidwinski

any chance of this making it into the 1.4.x series as well? In addition to the 1.5.0 release?

Dec 08 '22 12:12 shantanugadgil

Sorry for the delay on this... Luiz is on a well-deserved holiday so I'm going to pick up the review from here with Seth. We're juggling a couple other things at the moment but this is definitely important so expect we'll follow through soonish. Thanks for your patience.

@shantanugadgil this is a bug fix, so it'll get backported to two major versions (i.e. if we release it in 1.5.0 it'll get backported to 1.4.x and 1.3.x).

Dec 08 '22 13:12 tgross

@tgross Thanks for the update. :pray: Looking forward to the backported releases of 1.4.x. :+1:

Dec 09 '22 08:12 shantanugadgil

Sounds great. Acking the comment and I'll come back to fix up the code early next week.

On Fri, Dec 9, 2022, 2:52 PM Tim Gross @.***> wrote:

@.**** commented on this pull request.

Hi @stswidwinski https://github.com/stswidwinski! Thanks for your patience on this. I've given it a review and I've left a few comments to tighten up the code that'll hopefully make it easier for the next folks who read this 😁

But I feel pretty good about this approach overall. I'm getting towards the end of my week here so I'm going to give this one more pass early next week and hopefully we can wrap it up then. Thanks!

In nomad/core_sched.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044764478:

+// olderVersionTerminalAllocs returns a tuplie ([]string, bool). The first element is the list of

+// terminal allocations which may be garbage collected for batch jobs. The second element indicates

+// whether or not the allocation itself may be garbage collected.

⬇️ Suggested change

-// olderVersionTerminalAllocs returns a tuplie ([]string, bool). The first element is the list of

-// terminal allocations which may be garbage collected for batch jobs. The second element indicates

-// whether or not the allocation itself may be garbage collected.

+// olderVersionTerminalAllocs returns a tuple ([]string, bool). The first element is the list of

+// terminal allocations which may be garbage collected for batch jobs. The second element indicates

+// whether or not the evaluation itself may be garbage collected.

Also, the caller can determine if its safe to GC the eval because the length of the allocs parameter has to match the length of the return value, right? And if there are no allocs left, is it safe to just go ahead and GC it without calling this function?

In command/agent/config.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044772065:
EvalGCThreshold string `hcl:"eval_gc_threshold"`
// BatchEvalGCThreshold controls how "old" an evaluation must be to be eligible

// for GC if the eval belongs to a batch job.

BatchEvalGCThreshold string hcl:"batch_eval_gc_threshold"

There's a ParseDuration call we'll need to make in command/agent.go as well. See the equivalent for EvalGCThreshold at agent.go#L359-L365 https://github.com/hashicorp/nomad/blob/v1.4.3/command/agent/agent.go#L359-L365 for an example

In nomad/core_sched_test.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044778966:
  if err != nil {
	t.Fatalf("err: %v", err)
}
if out == nil {
	t.Fatalf("bad: %v", out)
}
We're migrating over to https://github.com/shoenig/test as we touch old tests and write new ones. For all the new assertions, let's use that. ⬇️ Suggested change
if err != nil {
	t.Fatalf("err: %v", err)
}
if out == nil {
	t.Fatalf("bad: %v", out)
}
must.NoError(t, err)
must.NotNil(t, out)
In nomad/core_sched_test.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044782426:
if err != nil {
  t.Fatalf("err: %v", err)
}

// Update the time tables to make this work

tt := s1.fsm.TimeTable()

tt.Witness(2000, time.Now().UTC().Add(-1*s1.config.EvalGCThreshold))
// A little helper for assertions

assertCorrectEvalAlloc := func(
ws memdb.WatchSet,
eval *structs.Evaluation,
allocsShouldExist []*structs.Allocation,
allocsShouldNotExist []*structs.Allocation,
) {
This makes test outputs a lot more legible if a test fails: ⬇️ Suggested change

) {

) {

t.Helper()

In nomad/core_sched_test.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044788233:

alloc3.Job = job2

alloc3.JobID = job2.ID

// Insert allocs with indexes older than job.ModifyIndex. Two cases:

For clarity, because of what we'll actually check: ⬇️ Suggested change

// Insert allocs with indexes older than job.ModifyIndex. Two cases:

// Insert allocs with indexes older than job.ModifyIndex and job.JobModifyIndex. Two cases:

In nomad/core_sched_test.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044796022:

+// An EvalGC should reap allocations from jobs with a newer modify index and reap the eval itself

+// if all allocs are reaped.

+func TestCoreScheduler_EvalGC_Batch_OldVersionReapsEval(t *testing.T) {

These tests have a lot of setup code (as I'm sure you noticed!) and the assertion is only subtly different than the previous one. It might make the whole test more understandable if we collapsed these into a single test and had this one be the final assertion that the eval is GC'd once the conditions are right.

In nomad/core_sched.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1044763168:
for _, alloc := range allocs {
if alloc.Job != nil && alloc.Job.CreateIndex < job.CreateIndex && alloc.TerminalStatus() {
if alloc.CreateIndex < job.JobModifyIndex && alloc.ModifyIndex < thresholdIndex && alloc.TerminalStatus() {
I think so. (*Job).JobModifyIndex is only updated when the user-facing version of the job is updated (as opposed to (*Job).ModifyIndex which is updated whenever the job is updated in raft for any reason, like being marked complete or whatever.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/pull/15097#pullrequestreview-1212126851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECDANCO2YJDFUGIJEN554LWMOEYLANCNFSM6AAAAAARUF57ZY . You are receiving this because you were mentioned.Message ID: @.***>

Dec 09 '22 21:12 stswidwinski

I've actually found a bug in the logic in which we would apply the regular GC timeout first and not GC evaluations that should be GCed if BatchEvalGCPeriod < EvalGCPeriod. Fixed. Adding all of the changes soon.

Dec 12 '22 22:12 stswidwinski

@stswidwinski is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

Dec 12 '22 22:12 vercel[bot]

Please note that I cannot run the full suite of tests locally and must leave my computer at this time. If any tests are failing, expect that I will fix them before I ask for meaningful review and/or feedback.

Dec 12 '22 22:12 stswidwinski

@tgross, please let me know if there is anything outstanding from my end. I want to make sure that I'm not blocking you

Dec 20 '22 08:12 stswidwinski

Thanks for the effort @stswidwinski, just FYI most folks are out until January; we'll likely get this merged first thing then and cut a release shortly after

Dec 20 '22 14:12 shoenig

Happy new year! I'm wondering if we know when to expect another round of review?

On Tue, Dec 20, 2022, 9:04 AM Seth Hoenig @.***> wrote:

Thanks for the effort @stswidwinski https://github.com/stswidwinski, just FYI most folks are out until January; we'll likely get this merged first thing then and cut a release shortly after

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/pull/15097#issuecomment-1359414892, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECDANE4Y3BGGL2BQBFPIU3WOG4HVANCNFSM6AAAAAARUF57ZY . You are receiving this because you were mentioned.Message ID: @.***>

Jan 12 '23 13:01 stswidwinski

Hi @stswidwinski, happy new year 🎉

As my colleagues mentioned I was out on vacation last month. I will try to give it another round of review in the coming days.

Jan 14 '23 02:01 lgfa29

Happy new year! Hope you had a good time away :-) Thanks for the heads up!

On Sat, Jan 14, 2023, 2:27 AM Luiz Aoqui @.***> wrote:

Hi @stswidwinski https://github.com/stswidwinski, happy new year 🎉

As my colleagues mentioned I was out on vacation last month. I will try to give it another round of review in the coming days.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/pull/15097#issuecomment-1382638421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECDANAFNGNAJPLL7KZGXN3WSIFH5ANCNFSM6AAAAAARUF57ZY . You are receiving this because you were mentioned.Message ID: @.***>

Jan 14 '23 07:01 stswidwinski

All fair points, thank you. I have considered the configuration conflict you mention, but decided that the result is surprising in either case. Thank you for resolving the problem!

I'll make the requested changes early next week as I'm already out for the weekend. Have a good day!

On Fri, Jan 27, 2023, 8:01 PM Tim Gross @.***> wrote:

@.**** requested changes on this pull request.

Hi @stswidwinski https://github.com/stswidwinski! Again, sorry for the delay in reviewing this. It is a very tricky problem and as you mentioned above it's easy for bugs to hide bugs.

Something that complicated testing this end-to-end which took a little while for me to rememebr is that the time table has a hard-coded 5min resolution. I've pulled down this PR locally and patched the time table out to something tighter to make testing faster. I've found one problem around what happens if you purge a job and the batch_eval_gc_threshold is lower than the eval_gc_threshold. But once that's fixed I think this is ready to go.

Once you've done that, I'll push a commit up to add a changelog entry and docs for the new config flag. But if you'd like to do that instead, there's a make cl command and the docs are at server.mdx https://github.com/hashicorp/nomad/blob/main/website/content/docs/configuration/server.mdx

In nomad/core_sched.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1089368199:
	if !collect {
	// Find allocs associated with older (based on createindex) and GC them if terminal
	oldAllocs := olderVersionTerminalAllocs(allocs, job)
	return false, oldAllocs, nil
	oldAllocs := olderVersionTerminalAllocs(allocs, job, batchThresholdIndex)
	gcEval := (len(oldAllocs) == len(allocs))
	return gcEval, oldAllocs, nil
}
If a job has been purged, we'll skip past this block but we'll still have a list of allocs for the eval (which won't have been GC'd yet). When we hit the section below we're passing the thresholdIndex into allocGCEligible. If batchThresholdIndex is less than thresholdIndex (as it was in my testing), this results in the eval avoiding garbage collection until the later thresholdIndex. Not a huge deal, but it means the new config value isn't getting respected in that case.

To fix this bug and to avoid having to check the type multiple times in this function, it might be a good idea to use a single threshold index and set it based on the type at the top of the gcEval function. So something like:

func (c *CoreScheduler) gcEval(eval *structs.Evaluation, thresholdIndex uint64, batchThresholdIndex uint64, allowBatch bool) ( bool, []string, error) { // Ignore non-terminal and new evaluations if !eval.TerminalStatus() { return false, nil, nil } if eval.Type == structs.JobTypeBatch { thresholdIndex = batchThresholdIndex }

Alternatively, we could decide that if the job is nil that it's safe to use the least of the two values. But that feels like we're getting fussy.

In nomad/core_sched.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1089370456:

@@ -137,7 +137,7 @@ OUTER: allEvalsGC := true var jobAlloc, jobEval []string for _, eval := range evals {
	gc, allocs, err := c.gcEval(eval, oldThreshold, true)
	gc, allocs, err := c.gcEval(eval, oldThreshold, oldThreshold, true)
I think you're passing the same threshold here for the batch threshold parameter because we want to use the same threshold for everything we're trying to collect in this same pass. But that was a little confusing at a glance, so it might be nice to have a comment about that here.

In nomad/config.go https://github.com/hashicorp/nomad/pull/15097#discussion_r1089375347:

@@ -444,6 +451,7 @@ func DefaultConfig() *Config { ReconcileInterval: 60 * time.Second, EvalGCInterval: 5 * time.Minute, EvalGCThreshold: 1 * time.Hour,
BatchEvalGCThreshold:             168 * time.Hour,
In retrospect, this seems excessive. Let's set it to 24 hour by default, unless you've got a really strong reason to suggest otherwise.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/pull/15097#pullrequestreview-1273369708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECDANG6Q3QUFTCETAOZPYDWUQSS3ANCNFSM6AAAAAARUF57ZY . You are receiving this because you were mentioned.Message ID: @.***>

Jan 27 '23 20:01 stswidwinski

I tested the suggestions I made and they do seem to fix the problem. The changes I made are in https://github.com/hashicorp/nomad/commit/0d80be04fa594201110383b1687734d5e555c8e5. Feel free to cherry-pick them if they look correct, but also feel free to fix the test with a different approach slightly_smiling_face

This sounds great. I cherrypicked the changes on top.

Jan 30 '23 18:01 stswidwinski

Note: I'll add the changelog entries soon :-)

Jan 30 '23 18:01 stswidwinski

:fireworks: Thank you for bearing with me

Jan 31 '23 13:01 stswidwinski

Sorry, just noticed a mistake in my changes 😅

I pushed a commit to fix it.

Jan 31 '23 16:01 lgfa29

nomad nomad copied to clipboard

[15090] Ensure no leakage of evaluations for batch jobs.

But I feel pretty good about this approach overall. I'm getting towards the end of my week here so I'm going to give this one more pass early next week and hopefully we can wrap it up then. Thanks!

Also, the caller can determine if its safe to GC the eval because the length of the allocs parameter has to match the length of the return value, right? And if there are no allocs left, is it safe to just go ahead and GC it without calling this function?

There's a ParseDuration call we'll need to make in command/agent.go as well. See the equivalent for EvalGCThreshold at agent.go#L359-L365 https://github.com/hashicorp/nomad/blob/v1.4.3/command/agent/agent.go#L359-L365 for an example

Once you've done that, I'll push a commit up to add a changelog entry and docs for the new config flag. But if you'd like to do that instead, there's a make cl command and the docs are at server.mdx https://github.com/hashicorp/nomad/blob/main/website/content/docs/configuration/server.mdx

Alternatively, we could decide that if the job is nil that it's safe to use the least of the two values. But that feels like we're getting fussy.

I think you're passing the same threshold here for the batch threshold parameter because we want to use the same threshold for everything we're trying to collect in this same pass. But that was a little confusing at a glance, so it might be nice to have a comment about that here.

nomad
nomad copied to clipboard