scheduler: fixed a bug where the bandwidth of reserved cores were not taken into account
Description
Configuring cores as part of the client reserved resources only limits scheduling on those cores, but isn't reflected in the available MHz bandwidth, thus affecting logic in places such as bin spread .
Testing & Reproduction steps
[sandbox@nomad-dev ~]$ curl -s localhost:4646/v1/metrics?format=prometheus | egrep '^nomad_client_(un)?allocated_cpu' | grep ready
nomad_client_allocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="4ed2046a-e208-0751-2d2d-2bf4d966c140",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 0
nomad_client_unallocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="4ed2046a-e208-0751-2d2d-2bf4d966c140",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 278380
[sandbox@nomad-dev nomad]$ cat config.hcl
client {
reserved {
# cpu = 1000
# cores = "0-122"
}
}
versus
[sandbox@nomad-dev nomad]$ curl -s localhost:4646/v1/metrics?format=prometheus | egrep '^nomad_client_(un)?allocated_cpu' | grep ready
nomad_client_allocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="dc93b9fe-e3ab-8058-5006-3ee5696c3e1e",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 0
nomad_client_unallocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="dc93b9fe-e3ab-8058-5006-3ee5696c3e1e",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 2245
[sandbox@nomad-dev nomad]$ cat config.hcl
client {
reserved {
# cpu = 1000
cores = "0-122"
}
}
So on a VM with 124 cores, leaving the last core available for scheduling, we allegedly have 1 core or 2245 MHz / cpu available. While testing :
[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
type = "service"
group "cache" {
count = 1
task "redis" {
driver = "docker"
config {
image = "redis:latest"
}
resources {
cpu = 3000
}
}
}
}
[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 job.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
type = "service"
group "cache" {
count = 1
task "redis" {
driver = "docker"
config {
image = "redis:latest"
}
resources {
cores = 2
}
}
}
}
[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- WARNING: Failed to place all allocations.
Task Group "cache" (failed to place 1 allocation):
* Resources exhausted on 1 nodes
* Dimension "cores" exhausted on 1 nodes
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 job.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
Mimic of the situation, where we reserve the same amount of bandwidth but using cpu instead of cores we do fail :
[sandbox@nomad-dev nomad]$ cat config.hcl
client {
reserved {
cpu = 276135
}
}
[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
type = "service"
group "cache" {
count = 1
task "redis" {
driver = "docker"
config {
image = "redis:latest"
}
resources {
cpu = 3000
}
}
}
}
[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- WARNING: Failed to place all allocations.
Task Group "cache" (failed to place 1 allocation):
* Resources exhausted on 1 nodes
* Dimension "cpu" exhausted on 1 nodes
Job Modify Index: 0
To submit the job with version verification run:
nomad job run -check-index 0 job.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
Links
Contributor Checklist
- [x] Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the
make clcommand. - [x] Testing Please add tests to cover any new functionality or to demonstrate bug fixes and ensure regressions will be caught.
- [ ] Documentation If the change impacts user-facing functionality such as the CLI, API, UI, and job configuration, please update the Nomad website documentation to reflect this. Refer to the website README for docs guidelines. Please also consider whether the change requires notes within the upgrade guide.
Reviewer Checklist
- [x] Backport Labels Please add the correct backport labels as described by the internal backporting document.
- [ ] Commit Type Ensure the correct merge method is selected which should be "squash and merge" in the majority of situations. The main exceptions are long-lived feature branches or merges where history should be preserved.
- [ ] Enterprise PRs If this is an enterprise only PR, please add any required changelog entry within the public repository.
- [ ] If a change needs to be reverted, we will roll out an update to the code within 7 days.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.
Hey @tgross , I like the speed at which you picking up on draft PRs 😉 I have added some rudimentary description, describing the basics of problem at hand and reproducer for latest main (but also some initial tests). I'm changing work laptops by the end of week, hence my premature push, so I probably won't be able to continue working on anyhow.
Hey @tgross , before I pick this up again, do you have a preference of first creating a GitHub issue , or perhaps already any feedback in the direction of this solution ?
Sorry, I got sidetracked and didn't get a chance to re-review your updated description here. No need to open a new GitHub issue for it, we can discuss here.
The overall problem you're describing makes sense, It looks like the node is fingerprinting the usable compute as I'd expect. Ex. this instance has 22 cores and a total of 25400 MHz (mix of pCores and eCores). If I reserve one core:
client {
reserved {
cores = "0"
}
}
And as your AvailableResources method reflects, the node fingerprint does reflect the correct values:
$ nomad node status -self -verbose | grep compute
cpu.totalcompute = 25400
cpu.usablecompute = 24000
It looks like your approach is to tweak the subtraction of comparable resources after the fact. And it looks like this has broken some tests. Wouldn't it be better to make sure that the comparable resources actually include the usable CPU (less the reserved cores) to begin with, rather than patching that up?
@mvegter I saw that you pushed an update. I'm going to try to get this reviewed this week.
I've verified this impacts back to 1.8.x+ent, so I've added the appropriate backport labels.
@mvegter are you ready for a re-review on this? I didn't want to jump the gun if you were still in progress.
Hey @tgross , apologies for the late reply. It's been very busy the last couple of days, so I'm not able to contribute much at the moment. I messed up a rebase before hence all the force pushes to restore it to your previous review, feel free to leave any comments but not sure when I can come back to work on it.