nomad Multiple concurrent job submission causes CPU oversubscription

Nomad version

Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

Operating system and Environment details

Ubuntu 20.04

Issue

We've noticied that nodes in our cluster sometimes report oversubscription of CPU, or rather a negative value of unallocated CPU.

The value shown in the graph below is the raw value from the gauge nomad_client_unallocated_cpu.

We can also see this in the Nomad Topology view, where the bar for allocated CPU is overflowing the container:

Reproduction steps

We've only managed to observe this behavior when we start multiple jobs in a very short time (read milliseconds).

Leading us to believe that this is some kind of timing issue, where the calculation of available resources isn't able to take all the new allocations into account.

Expected Result

We never expect a node to report a negative value for unallocated CPU.

Actual Result

Multiple nodes in our cluster report negative values for unallocated CPU.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

May 19 '22 12:05 ostkrok

Hi @ostkrok, can you describe your environment a little more? Another possibility here could be that the CPU fingerprint changes over time, if its reported frequency fluctuates.

e.g.

nomad node status -self -verbose | grep cpu\.

May 19 '22 13:05 shoenig

Hi @shoenig,

I'd be happy to describe it a bit more, but I'm not really sure what you want to know :) Could you tell me if there is anything specific you want me to describe?

I don't think that the CPU fingerprint changes for us. It's a physical server using the performance CPU govenor, and we've never seen it report anything other than 88 000 Mhz.

> nomad node status -self -verbose | grep cpu\.
cpu.frequency                    = 3700
cpu.totalcompute                 = 88800

Here's a graph showing nomad_client_allocated_cpu + nomad_client_unallocated_cpu, which seem to always stay at 88 300 Mhz.

total_cpu

I'm guessing that the difference of 500 Mhz is related to us setting this in our Nomad config:

client {
    reserved {
        cpu = 500
    }
}

And here's a graph showing both metrics individually:

allocated_unallocated

May 20 '22 06:05 ostkrok

Hi @shoenig,

Just checking back to see if there is any additional information we can provide that would help you in identifying the cause of this issue?

May 30 '22 09:05 ostkrok

@ostkrok you mentioned:

We've only managed to observe this behavior when we start multiple jobs in a very short time (read milliseconds).

It might be helpful to look at whether the overlapping amount is from allocations that are currently in the process of shutting down to be replaced by other allocations that are currently being placed (either from that job or another one run at the same time). I checked the code at client/client.go#L2986 where this measurement is being taken, and it looks like if you had allocations that were migrated during updates you could almost certainly see this during the migration process.

Jun 06 '22 19:06 tgross

Hi again @tgross. Sorry for not getting back sooner.

It might be helpful to look at whether the overlapping amount is from allocations that are currently in the process of shutting down to be replaced by other allocations that are currently being placed (either from that job or another one run at the same time). I checked the code at client/client.go#L2986 where this measurement is being taken, and it looks like if you had allocations that were migrated during updates you could almost certainly see this during the migration process.

If this is the cause of our issue, wouldn't you think that the duration of the oversubscription would be shorter?

When we observe this behavior, the oversubscription persists until we manually stop jobs on the Nomad client. In the screenshot above you can see that one of our nodes was oversubscribed for around 7 hours.

Jun 23 '22 05:06 ostkrok

Hi again,

I just wanted to follow up with some more information regarding this.

Nomad 1.3.1 We initially noticed the problem with Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c), but we've since updated to Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56), and we're still seeing this issue.

Job Creation

I also want to add that we're creating these jobs via the API, if that helps to narrow down what could be causing the issue.

Jun 28 '22 08:06 ostkrok

If this is the cause of our issue, wouldn't you think that the duration of the oversubscription would be shorter?

When we observe this behavior, the oversubscription persists until we manually stop jobs on the Nomad client. In the screenshot above you can see that one of our nodes was oversubscribed for around 7 hours.

Ah yes, good point. I'm looking through the section of the client code (ref client.go#L2967-L2990) where this gets set and something that jumps out at me is that if we have allocations that are terminal on the server but not the client, they don't get counted towards the usage. Any chance you have any allocations that are currently stuck in pending on the client?

Jul 05 '22 14:07 tgross

Ah yes, good point. I'm looking through the section of the client code (ref client.go#L2967-L2990) where this gets set and something that jumps out at me is that if we have allocations that are terminal on the server but not the client, they don't get counted towards the usage. Any chance you have any allocations that are currently stuck in pending on the client?

As I mentioned we sometimes start a large amount of jobs (sometimes 10 or more) within 100 ms of each other. So I guess they start out pending? But they quite quickly move on to running, and stay running for several hours.

And just to check, do you think this is only a problem with the emitted metrics? Because we think it's more of a scheduling issue. I'm not really sure how Nomad is supposed to work in these conditions, but let me give you an example of how we think Nomad is supposed to work.

Let's say that we have a single node with 1000 CPU, and nothing reserved. We then start 15 jobs, each requiring 100 CPU, at the same time (let's say during 1 millisecond for this example).

What we expect to happen here is that Nomad will realize that scheduling all these on the one node would result in the node becoming overcommitted, and thereby only schedule 10 of the jobs on that node. Is this a correct assumption? Because what we observe makes it seem that Nomad doesn't correctly take this into account when multiple jobs are started within a very short timeframe.

Jul 12 '22 11:07 ostkrok

What we expect to happen here is that Nomad will realize that scheduling all these on the one node would result in the node becoming overcommitted, and thereby only schedule 10 of the jobs on that node. Is this a correct assumption?

That's correct. Nomad has an optimistically concurrent scheduler where each Evaluation is scheduled concurrently across the servers, but then the evaluation results in a Plan that's submitted to the Plan Applier that runs on the leader. This Plan Applier serializes the evaluations so that only allocations that fit are placed, and it rejects plans that don't fit to send back to the scheduler worker. (See Scheduling for more details on this.)

And just to check, do you think this is only a problem with the emitted metrics? Because we think it's more of a scheduling issue.

The reason I was focused on metrics was that's what you were presenting, so I guess I'm trying to figure out at this point whether you actually have overlapping allocations that you shouldn't. You should be able to determine this by looking at the nomad job status :job_id output and then finding the allocations on a given node via nomad node status -verbose (there's a list of allocs at the bottom).

I'm not saying it's impossible for there to be a bug in the optimistic scheduling, but the bits of code that do the serialization step are all pretty old and I'd have expected this kind of thing to pop up before. Any chance you're scheduling via cores and not cpu? (That's newer fit check.) It might help to have a job spec here as well.

Jul 12 '22 13:07 tgross

Thanks for taking the time to follow up @tgross.

Any chance you're scheduling via cores and not cpu? (That's newer fit check.) It might help to have a job spec here as well.

No, we're currently only using cpu.

The reason I was focused on metrics was that's what you were presenting, so I guess I'm trying to figure out at this point whether you actually have overlapping allocations that you shouldn't. You should be able to determine this by looking at the nomad job status :job_id output and then finding the allocations on a given node via nomad node status -verbose (there's a list of allocs at the bottom).

On our end, we are pretty sure that we have overlapping allocations that we shouldn't, but I understand if you want to verify that :)

It's not just the metrics showing this oversubscription, but also the API. We have built a tool that uses the Nomad HTTP API to fetch all allocations for a node and sum up the allocated CPU. This tool also shows us the same data as the metrics, i.e. more CPU allocated than should be allowed.

Furthermore (as I wrote in the initial issue body) we also see this in the Nomad GUI under the topology view. Which makes sense as it uses the same HTTP API I guess?

I'll get back to you with the output of nomad node status -verbose the next time it happens.

Jul 22 '22 07:07 ostkrok

Hi again @tgross,

We've now been able to collect output from nomad node status -verbose.

<redacted>@<redacted>:~$ nomad node status 65177d0c 
ID              = 65177d0c-ccde-6077-2a1f-0f3fbc07b3ec
Name            = <redacted>
Class           = worker
DC              = <redacted>
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2810h11m22s
Host Volumes    = certs
Host Networks   = default,public
CSI Volumes     = <none>
Driver Status   = docker,exec

Allocated Resources
CPU                Memory         Disk
218800/209500 MHz  27 GiB/62 GiB  8.8 GiB/413 GiB

Allocation Resource Utilization
CPU               Memory
25705/209500 MHz  5.2 GiB/62 GiB

Host Resource Utilization
CPU               Memory         Disk
34507/210000 MHz  12 GiB/63 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created     Modified
948baeae  65177d0c  <redacted>    0        run      running   8m5s ago    7m46s ago
d15519d1  65177d0c  <redacted>    0        run      running   8m5s ago    7m43s ago
2dde4fd8  65177d0c  <redacted>    0        run      running   8m5s ago    7m44s ago
72630770  65177d0c  <redacted>    0        run      running   8m5s ago    7m46s ago
cfe25686  65177d0c  <redacted>    0        run      running   8m5s ago    7m43s ago
afd111fa  65177d0c  <redacted>    26       run      running   1h50m ago   1h50m ago
6e985152  65177d0c  <redacted>    0        run      running   23h8m ago   23h7m ago
bb4c9aa1  65177d0c  <redacted>    0        run      running   23h8m ago   23h7m ago
68013340  65177d0c  <redacted>    0        run      running   23h8m ago   23h7m ago
a79141bd  65177d0c  <redacted>    0        run      running   23h8m ago   23h8m ago
047d6cb7  65177d0c  <redacted>    0        run      running   7d4h ago    7d4h ago
4eb0b94a  65177d0c  <redacted>    2        run      running   8d22h ago   8d22h ago
505719a9  65177d0c  <redacted>    2        run      running   8d22h ago   8d22h ago
69a2da54  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
93660f82  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
b66b30b6  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
c5fd2494  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
bf1eb1ae  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
cce7c580  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
d4fb31b1  65177d0c  <redacted>    1        run      running   10d15h ago  10d15h ago
f5ef6f26  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
88165c85  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
9c42df80  65177d0c  <redacted>    2        run      running   10d15h ago  10d15h ago
b32401ae  65177d0c  <redacted>    1        run      running   14d4h ago   14d4h ago
eb902ca8  65177d0c  <redacted>    4        run      running   1mo5d ago   1mo5d ago
8c7c793f  65177d0c  <redacted>    4        run      running   1mo17d ago  1mo5d ago
6e08badb  65177d0c  <redacted>    4        run      running   1mo17d ago  1mo5d ago
613693f0  65177d0c  node-exporter 4        run      running   2mo8d ago   1mo5d ago
09d7de47  65177d0c  cadvisor      4        run      running   2mo8d ago   1mo5d ago
ec8c50e5  65177d0c  <redacted>    1        run      running   2mo8d ago   2mo8d ago

One interesting thing is that the allocated CPU seen in the output above (218800) doesn't match what we see in nomad_client_allocated_cpu (218700). I.e. it differs by 100 Mhz.

We also went through and checked every alloc with nomad alloc status, fetched the task resources, and summed these up. The sum we got was consistent with the output from nomad node status (218800).

Is there anything else you want me to check while (evaluations or something like that) now that we have this state in our cluster?

Sep 07 '22 11:09 ostkrok

Hi @tgross,

Just checking back to see if you have any news about this issue? Since my last post we've experienced this issue two more times.

Oct 11 '22 05:10 ostkrok

Hi @tgross, I'm experiencing a similar issue with memory oversubscription. I looked at my monitoring system and the problem was already present 2 years ago when I started utilizing Nomad.

Is there any way I can help with this investigation?

Feb 13 '23 15:02 szabolcsgelencser

Hi again,

Just wanted to give a brief update on our situation. We've updated Nomad several times since the issue was created, and we are now running 1.6.1. From the metrics we keep I can see that this issue hasn't happened once during the last 3 months (we don't keep the Nomad metrics for longer than that).

So from our end, I'd say that this looks fixed, although I'm not sure if it got fixed in 1.4, 1.5, 1.6, or somewhere in between.

Jan 19 '24 12:01 ostkrok

Thanks for the update @ostkrok and apologies for not getting back to this earlier. I never was able to quite reproduce the behavior you're seeing. I'm going to close this issue for now but certainly if you encounter it again with a current version we'd be happy to take a look.

Jan 19 '24 13:01 tgross

nomad nomad copied to clipboard

Multiple concurrent job submission causes CPU oversubscription

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

nomad
nomad copied to clipboard