nomad icon indicating copy to clipboard operation
nomad copied to clipboard

System job with constrains fails to plan

Open chilloutman opened this issue 3 years ago • 16 comments

Nomad version

v1.2.6

(Nomad v1.2.6 has problem described below, while Nomad v1.1.5 works as expected.)

Operating system and Environment details

Nomad nodes are running Ubuntu. Docker driver is used for all tasks.

A set of nodes has node.class set to worker and there are few other nodes in the cluster.

Issue

System job with constrains fails to plan.

Reproduction steps

A job with type = "system" is used to schedule tasks on the worker nodes. So the following constraint is added to the worker group:

constraint {
  attribute = "${node.class}"
  operator  = "="
  value     = "worker"
}

Expected Result

All the worker nodes should run the worker task, all other nodes should not.

Actual Result

This works sometimes, in particular when there are no allocations on the cluster. But running nomad job plan after allocations are running displays the following warning:

Scheduler dry-run:
- WARNING: Failed to place allocations on all nodes.
  Task Group "worker" (failed to place 1 allocation):
    * Class "entry": 1 nodes excluded by filter
    * Constraint "${node.class} = worker": 1 nodes excluded by filter

This should not be a warning, as the allocations match the job definition, considering the constraints. nomad job run produces the desired state and the job state is displayed as “not scheduled” on all non-worker nodes.

Removing the constrains shows no warning, but obviously schedules the worker task on non-worker nodes, which is unwanted.

The only workaround seems be to ignore warnings, which defeats the purpose of nomad job plan, or create a entire separate cluster for the workers.

Possibly related:

  • #12016
  • https://discuss.hashicorp.com/t/system-job-with-constrains-fails-to-plan/37816

chilloutman avatar Apr 22 '22 07:04 chilloutman

I'm facing the same problem (1.2.6):

Job: "stage-cron" Task Group: "cron" (1 ignore) Task: "cron"

Scheduler dry-run:

  • WARNING: Failed to place allocations on all nodes. Task Group "cron" (failed to place 1 allocation):
    • Constraint "${meta.env} = stage": 5 nodes excluded by filter

But if I stop job before submitting a new job, it works as expected:

$ nomad job stop stage-cron ==> 2022-04-25T18:45:07+03:00: Monitoring evaluation "86e8c675" 2022-04-25T18:45:07+03:00: Evaluation triggered by job "stage-cron" ==> 2022-04-25T18:45:08+03:00: Monitoring evaluation "86e8c675" 2022-04-25T18:45:08+03:00: Evaluation status changed: "pending" -> "complete" ==> 2022-04-25T18:45:08+03:00: Evaluation "86e8c675" finished with status "complete"

$ nomad job plan ...

+/- Job: "stage-cron" +/- Stop: "true" => "false" Task Group: "cron" (1 create) Task: "cron"

Scheduler dry-run:

  • All tasks successfully allocated.

cr0c0dylus avatar Apr 25 '22 15:04 cr0c0dylus

I have found a temporary workaround. You need to add 1.1.x server to the cluster and stop-start 1.2.6 leaders until 1.1.x becomes a leader.

cr0c0dylus avatar Apr 25 '22 18:04 cr0c0dylus

Hi @chilloutman! This definitely seems like it could be related to #12016. I'm not going to mark it as a duplicate just in case it's not but I'll cross-reference here so that whomever tackles that issue will see this as well. I don't have a good workaround for you other than to ignore warnings (they're warnings and not errors), but I realize that isn't ideal.

Just FYI @cr0c0dylus:

I have found a temporary workaround. You need to add 1.1.x server to the cluster and stop-start 1.2.6 leaders until 1.1.x becomes a leader.

This is effectively downgrading Nomad into a mixed-version cluster, which is not supported and highly likely to result in state store corruption. Doing so in order to suppress something that's only a warning is not advised.

tgross avatar May 02 '22 15:05 tgross

Doing so in order to suppress something that's only a warning is not advised.

Unfortunately, it is not only a warning. It cannot allocate a job at all. Another trick - to change one of the limits in resources stanza. For example, to add +1 to the CPU limit. But it doesn't work with some of my jobs.

cr0c0dylus avatar May 02 '22 19:05 cr0c0dylus

I wonder if this is related https://github.com/hashicorp/nomad/issues/11778#issuecomment-1135582278 It really looks like some bug in the scheduler that incorrectly fails placement during the node feasibility check. It is almost like it's not iterating through all nodes but for some reason returns a placement failure while it hasn't exhausted the full list yet.

ygersie avatar May 31 '22 08:05 ygersie

I am also facing this issue and I had to downgrade nomad.

lssilva avatar Jun 07 '22 13:06 lssilva

I'm wondering if this could be the cause: https://github.com/hashicorp/nomad/pull/11111/files#diff-c4e3135b7aa83ba07d59d003a8ab006915207425b8728c4cf070eee20ab9157a

"// track node filtering, to only report an error if all nodes have been filtered" might not be working as intended. Or maybe instead of only warnings #11111 ended up causing errors.

chilloutman avatar Jun 20 '22 12:06 chilloutman

Verified we hit this with constraints on 1.2.6 as well.

Mitigation was reverting this to 1.1.5.

I do not know how bugs are prioritized but this should probably be pretty high.

jmwilkinson avatar Jun 29 '22 23:06 jmwilkinson

BTW, it would be great if I those warnings can be completely disabled in config. If I have 50 nodes in cluster and make constraint for 3 nodes - what the sense to see "47 Not Scheduled"? System jobs are very useful for scaling in HA configuration - I don't need to modify job stanza, just add or remove nodes with a special meta variable.

cr0c0dylus avatar Jun 30 '22 07:06 cr0c0dylus

I'm wondering if this could be the cause: https://github.com/hashicorp/nomad/pull/11111/files#diff-c4e3135b7aa83ba07d59d003a8ab006915207425b8728c4cf070eee20ab9157a

"// track node filtering, to only report an error if all nodes have been filtered" might not be working as intended. Or maybe instead of only warnings #11111 ended up causing errors.

It's the cause indeed. Reverting this pull request fixed the issue for me on 1.3.1.

dext0r avatar Jun 30 '22 07:06 dext0r

Nomad v1.2.9 (86192e40b1141237c29fe17f31a5734efc35ef8a)

The problem persists. I still need to stop the 1.2.9 masters in sequence until 1.0.18 becomes the leader and allows deployment.

cr0c0dylus avatar Jul 14 '22 14:07 cr0c0dylus

There may be a fix in 1.3.2, at least it looks that way: https://github.com/hashicorp/nomad/blob/v1.3.2/scheduler/scheduler_system.go#L298

jmwilkinson avatar Jul 21 '22 15:07 jmwilkinson

Issue still exists in v1.5.3, frequently run into this when upgrading system jobs.

While the nomad CLI reports this error, the rollout will still actually happen in Nomad.

seanamos avatar Jun 14 '23 11:06 seanamos

I am seeing the same behavior as @seanamos in v1.6.3

nCrazed avatar Dec 20 '23 20:12 nCrazed

The problem continues to occur in v1.7.3

cr0c0dylus avatar Jan 22 '24 14:01 cr0c0dylus

Can confirm still present in Nomad v1.7.7.

elgatopanzon avatar Jun 26 '24 21:06 elgatopanzon

I'm seeing this with Nomad 1.8.3, but additionally its failing with not only plan but run.

I have 12 nodes running:

$ nomad node status -quiet
b15f6629-da08-0f17-8058-0a3032a769e1
31090485-dbe1-4b72-00bb-0e1282d82210
dc5177d0-7e07-28c2-8ddc-584be7c66c75
22a71c04-f531-7680-0019-b0e51bf83ba1
be37c669-d199-a716-2866-e4642aec3665
dcf03550-47ef-fe32-cbe1-67b711744608
d0e7cbfc-b934-ba23-54ff-cf38531c355a
3a7e8085-44bc-a150-6a1d-0040353a8528
2e01087a-f3a6-f86d-fd36-a18d86b92da2
8d8ba3e3-c180-9f1e-b2a2-08d42aad4e4d
ffbbb77f-744a-26f2-6a21-bf1d19316865
2f0b6be8-6564-6bbe-d85c-2457e532243f

I have a single system job with the following constraint (and no others):

  constraint {
    attribute = "${node.class}"
    value     = "private-t38"
  }

Which matches node ffbbb77f:

$ nomad node status ffbbb77f
ID              = ffbbb77f-744a-26f2-6a21-bf1d19316865
Name            = prod-ap-northeast-1-private-t38-i-SOME_INSTANCE_ID
Node Pool       = default
Class           = private-t38
...

When I try to place the job, it fails:

$ nomad job run job.nomad
==> 2024-09-16T19:04:26-04:00: Monitoring evaluation "6c4f7100"
    2024-09-16T19:04:26-04:00: Evaluation triggered by job "metadataproxy"
    2024-09-16T19:04:28-04:00: Evaluation status changed: "pending" -> "complete"
==> 2024-09-16T19:04:28-04:00: Evaluation "6c4f7100" finished with status "complete" but failed to place all allocations:
    2024-09-16T19:04:28-04:00: Task Group "app" (failed to place 1 allocation):
      * Class "private-nomad": 3 nodes excluded by filter
      * Class "private-common": 1 nodes excluded by filter
      * Class "public-dt316": 6 nodes excluded by filter
      * Constraint "${node.class} = private-t38": 11 nodes excluded by filter

The job itself is already running on the 1 node that matches that constraint. Once I stop the job, it can get placed.

One potential interesting detail is that the job itself hasn't changed in between invocations. If I change the job contents somehow, it'll place properly. This only happens when attempting to re-apply a job as it already exists - my guess is that it detects that the jobs haven't changed, and therefore marks that node as a conflict for some reason as opposed to "already placed" (dunno if there is a word for that).

Not sure if this is exactly the above issue, but happy to dive in further if folks think its related :)

josegonzalez avatar Sep 16 '24 23:09 josegonzalez

This is fixed by https://github.com/hashicorp/nomad/pull/25850 which will ship in the next release of Nomad (with backports to Nomad Enterprise).

Thanks!

allisonlarson avatar May 15 '25 22:05 allisonlarson

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar Sep 13 '25 02:09 github-actions[bot]