Dead task was not stopped and is not planned to run
Nomad version
$ nomad --version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc
Operating system and Environment details
# uname -a
Linux ip-172-31-12-145 6.2.0-1010-aws #10~22.04.1-Ubuntu SMP Wed Aug 16 18:00:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Issue
My task dead. I decided to stop it and run it again to check errors. But it is not counted to start by nomad. It just ignores it. So I can not understand how to start cluster from scratch.
Reproduction steps
- Create job with three groups. One of them should fail.
- Stop job
- Start it again
Expected Result
I expect that all groups will start from beginning.
Actual Result
Dead task was not planned to stop and start.
nomad output
$ nomad job stop platform-exp
==> 2023-09-12T17:24:32-04:00: Monitoring evaluation "96924055"
2023-09-12T17:24:32-04:00: Evaluation triggered by job "platform-exp"
2023-09-12T17:24:32-04:00: Evaluation within deployment: "a90bd5c8"
2023-09-12T17:24:32-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-12T17:24:32-04:00: Evaluation "96924055" finished with status "complete"
==> 2023-09-12T17:24:32-04:00: Monitoring deployment "a90bd5c8"
! Deployment "a90bd5c8" cancelled
2023-09-12T17:24:32-04:00
ID = a90bd5c8
Job ID = platform-exp
Job Version = 2
Status = cancelled
Description = Cancelled because job is stopped
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
backend 1 1 0 1 2023-09-12T21:33:20Z
redis-db 1 1 0 1 2023-09-12T21:33:20Z
$ nomad job plan cluster.hcl
+/- Job: "platform-exp"
+/- Stop: "true" => "false"
Task Group: "backend" (1 create)
Task: "api-task"
Task: "draft-task"
Task: "knecht-task"
Task: "nginx-task"
Task: "webshot-task"
Task Group: "postgres-db" (1 ignore)
Task: "postgres-task"
Task Group: "redis-db" (1 create)
Task: "redis"
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 63243
To submit the job with version verification run:
nomad job run -check-index 63243 cluster.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
$ nomad job run -check-index 63243 cluster.hcl
==> 2023-09-12T17:23:18-04:00: Monitoring evaluation "efb18639"
2023-09-12T17:23:18-04:00: Evaluation triggered by job "platform-exp"
2023-09-12T17:23:18-04:00: Evaluation within deployment: "a90bd5c8"
2023-09-12T17:23:18-04:00: Allocation "82084c5b" created: node "6fc9184f", group "backend"
2023-09-12T17:23:18-04:00: Allocation "b85dd05d" created: node "6fc9184f", group "redis-db"
2023-09-12T17:23:18-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-12T17:23:18-04:00: Evaluation "efb18639" finished with status "complete"
==> 2023-09-12T17:23:18-04:00: Monitoring deployment "a90bd5c8"
⠇ Deployment "a90bd5c8" in progress...
2023-09-12T17:23:18-04:00
ID = a90bd5c8
Job ID = platform-exp
Job Version = 2
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
backend 1 1 0 0 2023-09-12T21:33:20Z
redis-db 1 1 0 0 2023-09-12T21:33:20Z
After some timeout all tasks are started after another plan/run
Probably I am starting tasks too early after 'stop' command. I think this could be useful, if nomad check tasks with 'stopping' before starting new one.
Hi @EugenKon!
I wasn't able to reproduce this kind of behavior, but it definitely looks strange to me. The output you've shown for nomad job stop doesn't show the task group postgres-db but shows the two other task groups as unhealthy. But then the nomad job plan shows the task group as "ignored" which suggests that the postgres-db task group was already successfully running.
Was postgres-db allocation and its tasks actually running before you stopped the job?
@tgross The postgres-db task was running and at the moment of stop it was in dead state with exceeds allowed attempts. I think the issue exists because of that exceed.
Before I stopped it. I plan/run job again and I expected exceeds attempts should be flushed. It did not help. Then I tried to stop/plan/run and faced into issue I showed here.
Here is a fresh output:
$ nomad job plan cluster.hcl
Job: "platform-exp"
Task Group: "postgres-db" (1 ignore)
Task: "postgres-task"
Task Group: "redis-db" (1 ignore)
Task: "redis"
Task Group: "services" (1 ignore)
Task: "api-task"
Task: "draft-task"
Task: "knecht-task"
Task: "nginx-task"
Task: "webshot-task"
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 63933
To submit the job with version verification run:
nomad job run -check-index 63933 cluster.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
$ nomad job run -check-index 63933 cluster.hcl
==> 2023-09-13T10:44:27-04:00: Monitoring evaluation "90c6e51e"
2023-09-13T10:44:27-04:00: Evaluation triggered by job "platform-exp"
2023-09-13T10:44:27-04:00: Evaluation within deployment: "8fd5730e"
2023-09-13T10:44:27-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T10:44:27-04:00: Evaluation "90c6e51e" finished with status "complete"
==> 2023-09-13T10:44:27-04:00: Monitoring deployment "8fd5730e"
! Deployment "8fd5730e" failed
2023-09-13T10:44:27-04:00
ID = 8fd5730e
Job ID = platform-exp
Job Version = 14
Status = failed
Description = Failed due to progress deadline
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
postgres-db 1 1 0 1 2023-09-12T22:04:49Z
Thanks for that extra detail @EugenKon!
Before I stopped it. I plan/run job again and I expected exceeds attempts should be flushed. It did not help.
Ok, that part is expected behavior if it hadn't been stopped yet. The postgres-db task group didn't change, so the Nomad scheduler always ignores any request to run again. And I think this behavior isn't getting reset properly when you stop the job.
Then I tried to stop/plan/run and faced into issue I showed here.
Right. Once you've stopped the job, Nomad should be treating that as a change that allows scheduling again, because it obviously did already for the other task groups. From my reproduction (details below), I have a suspicion that the problem is that the desired status is not getting set to stop for the terminal allocation, and that's dictating the behavior of the scheduler. Let me check in with some of my colleagues on this to see if we can figure out where the decision making is going wrong.
Reproduction (Very Long)
I've been able to reproduce the behavior you saw here with the following jobspec:
jobspec
job "example" {
group "web" {
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-vv", "-f", "-p", "8001", "-h", "/local"]
}
resources {
cpu = 100
memory = 100
}
}
}
group "db" {
reschedule {
attempts = 0
unlimited = false
}
restart {
attempts = 1
delay = "1s"
}
task "postgres" {
driver = "docker"
config {
image = "busybox:x"
command = "httpd"
args = ["-vv", "-f", "-p", "8001", "-h", "/local"]
}
resources {
cpu = 100
memory = 100
}
}
}
}
Note that I've disabled rescheduling but allowed 1 restart. If we run the job, the deployment fails as expected:
$ nomad job run ./example.nomad.hcl
==> 2023-09-13T16:27:29-04:00: Monitoring evaluation "6a6b8263"
2023-09-13T16:27:29-04:00: Evaluation triggered by job "example"
2023-09-13T16:27:30-04:00: Evaluation within deployment: "e8998267"
2023-09-13T16:27:30-04:00: Allocation "04c2d55b" created: node "5273c01f", group "web"
2023-09-13T16:27:30-04:00: Allocation "72ccb0df" created: node "5273c01f", group "db"
2023-09-13T16:27:30-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T16:27:30-04:00: Evaluation "6a6b8263" finished with status "complete"
==> 2023-09-13T16:27:30-04:00: Monitoring deployment "e8998267"
! Deployment "e8998267" failed
2023-09-13T16:27:35-04:00
ID = e8998267
Job ID = example
Job Version = 0
Status = failed
Description = Failed due to unhealthy allocations
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
db 1 1 0 1 2023-09-13T16:37:29-04:00
web 1 1 0 0 2023-09-13T16:37:29-04:00
If we try to plan this again, it ignores both allocations as expected, because the task group hasn't changed:
$ nomad job plan example.nomad.hcl
Job: "example"
Task Group: "db" (1 ignore)
Task: "postgres"
Task Group: "web" (1 ignore)
Task: "http"
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 11
To submit the job with version verification run:
nomad job run -check-index 11 example.nomad.hcl
When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
Note that at this point the job status is "running", the allocation status is "failed", and the task is "dead":
nomad job status
$ nomad job status example
ID = example
Name = example
Submit Date = 2023-09-13T16:27:29-04:00
Type = service
Priority = 50
Datacenters = *
Namespace = default
Node Pool = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
db 0 0 0 1 0 0 0
web 0 0 1 0 0 0 0
Latest Deployment
ID = e8998267
Status = failed
Description = Failed due to unhealthy allocations
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
db 1 1 0 1 2023-09-13T16:37:29-04:00
web 1 1 1 0 2023-09-13T16:37:40-04:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
04c2d55b 5273c01f web 0 run running 1m17s ago 1m5s ago
72ccb0df 5273c01f db 0 run failed 1m17s ago 1m11s ago
$ nomad alloc status 72cc
ID = 72ccb0df-1c8c-4c69-e671-f061f68745b0
Eval ID = 6a6b8263
Name = example.db[0]
Node ID = 5273c01f
Node Name = continuity
Job ID = example
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 1m57s ago
Modified = 1m51s ago
Deployment ID = e8998267
Deployment Health = unhealthy
Task "postgres" is "dead"
...
When we stop the job, both task groups show up in the deployment listing:
$ nomad job stop example
==> 2023-09-13T16:33:06-04:00: Monitoring evaluation "4cb0872c"
2023-09-13T16:33:06-04:00: Evaluation triggered by job "example"
2023-09-13T16:33:07-04:00: Evaluation within deployment: "e8998267"
2023-09-13T16:33:07-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T16:33:07-04:00: Evaluation "4cb0872c" finished with status "complete"
==> 2023-09-13T16:33:07-04:00: Monitoring deployment "e8998267"
! Deployment "e8998267" failed
2023-09-13T16:33:07-04:00
ID = e8998267
Job ID = example
Job Version = 0
Status = failed
Description = Failed due to unhealthy allocations
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
db 1 1 0 1 2023-09-13T16:37:29-04:00
web 1 1 1 0 2023-09-13T16:37:40-04:00
And if we look at the job status after we've stopped, we see that the web task is Desired=stop and Status=complete, whereas the db task is Desired=run and Status=failed. I suspect this is where the problem is happening!
$ nomad job status example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
04c2d55b 5273c01f web 0 stop complete 3m17s ago 1m5s ago
72ccb0df 5273c01f db 0 run failed 3m17s ago 1m11s ago
And if we look at the scheduler's debug logs, the scheduler didn't think anything needed to be done to the db group:
results= | Total changes: (place 0) (destructive 0) (inplace 0) (stop 1) (disconnect 0) (reconnect 0) | Desired Changes for "web": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0) | Desired Changes for "db": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
If we plan the job after stopping it, we see only the web task group is created:
$ nomad job plan example.nomad.hcl
+/- Job: "example"
+/- Stop: "true" => "false"
Task Group: "db" (1 ignore)
Task: "postgres"
Task Group: "web" (1 create)
Task: "http"
Scheduler dry-run:
- All tasks successfully allocated.
...
The scheduler's debug logs tell the same story:
results= | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0) | Created Deployment: "df57fa02-eeb1-1e59-5c1c-930d0b0e4207" | Desired Changes for "web": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0) | Desired Changes for "db": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
And then if we run the job, we can see exactly what you reported. Only the web task group gets started:
$ nomad job run ./example.nomad.hcl
==> 2023-09-13T16:33:33-04:00: Monitoring evaluation "5b8910cd"
2023-09-13T16:33:33-04:00: Evaluation triggered by job "example"
2023-09-13T16:33:33-04:00: Allocation "395e843d" created: node "5273c01f", group "web"
2023-09-13T16:33:34-04:00: Evaluation within deployment: "8913a73b"
2023-09-13T16:33:34-04:00: Allocation "395e843d" status changed: "pending" -> "running" (Tasks are running)
2023-09-13T16:33:34-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T16:33:34-04:00: Evaluation "5b8910cd" finished with status "complete"
==> 2023-09-13T16:33:34-04:00: Monitoring deployment "8913a73b"
✓ Deployment "8913a73b" successful
2023-09-13T16:33:46-04:00
ID = 8913a73b
Job ID = example
Job Version = 2
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
web 1 1 1 0 2023-09-13T16:43:44-04:00
And then we'll verify that by looking at the job status:
$ nomad job status example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
395e843d 5273c01f web 2 run running 31s ago 20s ago
04c2d55b 5273c01f web 0 stop complete 6m36s ago 53s ago
72ccb0df 5273c01f db 0 run failed 6m36s ago 6m30s ago
If I change the reschedule block to the following I get the same behavior, except of course that there's an extra allocation when we're done:
reschedule {
attempts = 1
interval = "15m"
unlimited = false
}
And if we reproduce this whole thing with only the "db" group we get the similar behavior.
Hi! Maybe there is some news? Seems to be a huge blocker at least to Nomad usage for orchestrating system jobs, that are used by a service job. By-design (nowadays), I need to stop&purge entire job (may be a proxy/dns cache/logging/monitoring) container for cluster. All time, that will be taken for re-deploying system job, may be a downtime for service job's load :(