nomad eval broker: shed blocked evals older than acknowledged eval

When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the ModifyIndex ("wait index") set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index.

Move these stale evals into a canceled set. When the Eval.Ack RPC returns from the eval broker it will retrieve a batch of canceable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress.

Note that the evals will still need to be deleted during garbage collection, but there's not much we can do about that without preventing the evals from being created in the first place.

original approach

When a node updates its fingerprint or status, we need to create new evaluations to ensure that jobs waiting for resources get a chance to be evaluated. But in the case of a cluster with a large backup of evaluations and flapping nodes, we can get a large backlog of evaluations for the same job. Most of these will be canceled but we still need to write the evaluation in raft and then write its deletion in raft.

This changeset proposes that we avoid creating evals for jobs that already have a blocked eval in the eval broker. A blocked eval means that the broker already has work in-flight and work waiting to be re-enqueued, so it's safe to drop the evaluation.

Sep 19 '22 19:09 tgross

diagram for the original approach to this PR, no longer applicable

sequenceDiagram
    Node RPC-->>Broker: node-update 1: has blocked eval? (no)
    Node RPC->>Raft: node-update 1
    Raft-->>Broker: enqueue eval
    Node RPC-->>Broker: node-update 2: has blocked eval? (no)
    Node RPC->>Raft: node-update 2
    Raft-->>Broker: can't enqueue eval (blocked)
    Node RPC-->>Broker: node-update 3: has blocked eval? (yes)    
    Broker-->>Scheduler: GetEval
    Node RPC-->>Broker: node-update 4: has blocked eval? (no)
    Node RPC->>Raft: node-update 4
    Raft-->>Broker: can't enqueue eval (blocked)

Sep 19 '22 20:09 tgross

@schmichael @lgfa29 @shoenig I've now got this into working shape (and green on CI except for the usual test flakes in unrelated areas of code). I'm reasonably comfortable with the idea of landing this in 1.4.0-rc1 for next week but it's probably worth having a discussion about how good we feel about this change coming so late.

Sep 23 '22 13:09 tgross

I've run a middling-scale performance test of this PR and the results look pretty good. I stood up a cluster of 3 t3.large instances on AWS (8GiB RAM).

Disable the schedulers on all 3 instances by setting server.num_schedulers = 0 and reloading.
Deploy 10 sytem jobs.
Deploy 5000 simulated client nodes via nomad-nodesim.
Wait until all 5000 nodes are registered.

At this point, there are 49730 pending evaluations in the state store and in the blocked queue on the broker. After commenting-out server.num_schedulers = 0 and reloading, the schedulers are restarted and begin processing. (19:30)

After 15s (19:30):

The blocked queue is empty.
Because some evals have been acked, we've already updated 16159 evals to canceled in raft, with 33694 evals in the canceleable queue on the broker.
At this point, the eval broker has no more work in-flight (in the ready queue).
No more acks are happening to drive cancellations, so we've fallen back to moving batches of evals out of the canelable list and writing them to raft as canceled every 5 sec.

After ~2m (19:32), there are still 16950 evals in the pending state (still on the cancelable queue).

At this point, I stopped the 5000 client nodes. Within 10s, all remaining evals on the cancelable queue have been writting to raft as caneled.

As each of the 5000 nodes miss heartbeats they spawn 10 evaluations each. We can see from the broker metrics that at no point do we have any backlog on the blocked list, because we're immediately moving batches off to cancelable and clearing them to canceled on the next ack. So long as you've got new evals coming in, the cancelable list is getting quickly cleared out.

Nov 09 '22 20:11 tgross

@tgross That test is phenomenal! Any chance you have the terraform or whatever laying around to run it again against 1.4.2? I'd love to compare the graphs.

Nov 15 '22 00:11 schmichael

I re-ran the test with 1.4.2 and it takes about 25m to reach the state we were able to reach in 5m previously.

Here's the chart for 1.4.2. A few notes:

We don't have the new Eval.Count here and I didn't notice until a little while into the test, so I had to dump nomad eval list and post-process to fill in that data. So that's got some trendlines over some of the data points, but the broker_blocked line shows the general progress.
If you look at the source data the number of client nodes isn't steadily 5001 -- doing this work loads the servers so much that we miss some heartbeats!
I stopped the nodes right around the same point as the original tests in terms of amount of outstanding work, but of course that was quite a bit later in terms of elapsed time.

And here's the same chart from above, with the time normalized to an elapsed time so we can compare apples-to-apples with the 1.4.2 chart. Conclusion: we're handling the same work in 20% of the time.

Nov 15 '22 16:11 tgross

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mar 18 '23 02:03 github-actions[bot]

nomad nomad copied to clipboard

eval broker: shed blocked evals older than acknowledged eval

nomad
nomad copied to clipboard