nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Dead task was not stopped and is not planned to run

Open EugenKon opened this issue 2 years ago • 4 comments

Nomad version

$ nomad --version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

# uname -a
Linux ip-172-31-12-145 6.2.0-1010-aws #10~22.04.1-Ubuntu SMP Wed Aug 16 18:00:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

My task dead. I decided to stop it and run it again to check errors. But it is not counted to start by nomad. It just ignores it. So I can not understand how to start cluster from scratch.

Reproduction steps

  1. Create job with three groups. One of them should fail.
  2. Stop job
  3. Start it again

Expected Result

I expect that all groups will start from beginning.

Actual Result

Dead task was not planned to stop and start.

nomad output

$ nomad job stop platform-exp
==> 2023-09-12T17:24:32-04:00: Monitoring evaluation "96924055"
    2023-09-12T17:24:32-04:00: Evaluation triggered by job "platform-exp"
    2023-09-12T17:24:32-04:00: Evaluation within deployment: "a90bd5c8"
    2023-09-12T17:24:32-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-12T17:24:32-04:00: Evaluation "96924055" finished with status "complete"
==> 2023-09-12T17:24:32-04:00: Monitoring deployment "a90bd5c8"
  ! Deployment "a90bd5c8" cancelled

    2023-09-12T17:24:32-04:00
    ID          = a90bd5c8
    Job ID      = platform-exp
    Job Version = 2
    Status      = cancelled
    Description = Cancelled because job is stopped

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    backend     1        1       0        1          2023-09-12T21:33:20Z
    redis-db    1        1       0        1          2023-09-12T21:33:20Z

$ nomad job plan cluster.hcl
+/- Job: "platform-exp"
+/- Stop: "true" => "false"
    Task Group: "backend" (1 create)
      Task: "api-task"
      Task: "draft-task"
      Task: "knecht-task"
      Task: "nginx-task"
      Task: "webshot-task"

    Task Group: "postgres-db" (1 ignore)
      Task: "postgres-task"

    Task Group: "redis-db" (1 create)
      Task: "redis"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 63243
To submit the job with version verification run:

nomad job run -check-index 63243 cluster.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

 $ nomad job run -check-index 63243 cluster.hcl
==> 2023-09-12T17:23:18-04:00: Monitoring evaluation "efb18639"
    2023-09-12T17:23:18-04:00: Evaluation triggered by job "platform-exp"
    2023-09-12T17:23:18-04:00: Evaluation within deployment: "a90bd5c8"
    2023-09-12T17:23:18-04:00: Allocation "82084c5b" created: node "6fc9184f", group "backend"
    2023-09-12T17:23:18-04:00: Allocation "b85dd05d" created: node "6fc9184f", group "redis-db"
    2023-09-12T17:23:18-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-12T17:23:18-04:00: Evaluation "efb18639" finished with status "complete"
==> 2023-09-12T17:23:18-04:00: Monitoring deployment "a90bd5c8"
  ⠇ Deployment "a90bd5c8" in progress...

    2023-09-12T17:23:18-04:00
    ID          = a90bd5c8
    Job ID      = platform-exp
    Job Version = 2
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    backend     1        1       0        0          2023-09-12T21:33:20Z
    redis-db    1        1       0        0          2023-09-12T21:33:20Z

After some timeout all tasks are started after another plan/run

Probably I am starting tasks too early after 'stop' command. I think this could be useful, if nomad check tasks with 'stopping' before starting new one.

EugenKon avatar Sep 12 '23 21:09 EugenKon

Hi @EugenKon!

I wasn't able to reproduce this kind of behavior, but it definitely looks strange to me. The output you've shown for nomad job stop doesn't show the task group postgres-db but shows the two other task groups as unhealthy. But then the nomad job plan shows the task group as "ignored" which suggests that the postgres-db task group was already successfully running.

Was postgres-db allocation and its tasks actually running before you stopped the job?

tgross avatar Sep 13 '23 14:09 tgross

@tgross The postgres-db task was running and at the moment of stop it was in dead state with exceeds allowed attempts. I think the issue exists because of that exceed.

Before I stopped it. I plan/run job again and I expected exceeds attempts should be flushed. It did not help. Then I tried to stop/plan/run and faced into issue I showed here.

Here is a fresh output:

$ nomad job plan cluster.hcl
Job: "platform-exp"
Task Group: "postgres-db" (1 ignore)
  Task: "postgres-task"

Task Group: "redis-db" (1 ignore)
  Task: "redis"

Task Group: "services" (1 ignore)
  Task: "api-task"
  Task: "draft-task"
  Task: "knecht-task"
  Task: "nginx-task"
  Task: "webshot-task"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 63933
To submit the job with version verification run:

nomad job run -check-index 63933 cluster.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

$ nomad job run -check-index 63933 cluster.hcl
==> 2023-09-13T10:44:27-04:00: Monitoring evaluation "90c6e51e"
    2023-09-13T10:44:27-04:00: Evaluation triggered by job "platform-exp"
    2023-09-13T10:44:27-04:00: Evaluation within deployment: "8fd5730e"
    2023-09-13T10:44:27-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T10:44:27-04:00: Evaluation "90c6e51e" finished with status "complete"
==> 2023-09-13T10:44:27-04:00: Monitoring deployment "8fd5730e"
  ! Deployment "8fd5730e" failed

    2023-09-13T10:44:27-04:00
    ID          = 8fd5730e
    Job ID      = platform-exp
    Job Version = 14
    Status      = failed
    Description = Failed due to progress deadline

    Deployed
    Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline
    postgres-db  1        1       0        1          2023-09-12T22:04:49Z

image

EugenKon avatar Sep 13 '23 14:09 EugenKon

Thanks for that extra detail @EugenKon!

Before I stopped it. I plan/run job again and I expected exceeds attempts should be flushed. It did not help.

Ok, that part is expected behavior if it hadn't been stopped yet. The postgres-db task group didn't change, so the Nomad scheduler always ignores any request to run again. And I think this behavior isn't getting reset properly when you stop the job.

Then I tried to stop/plan/run and faced into issue I showed here.

Right. Once you've stopped the job, Nomad should be treating that as a change that allows scheduling again, because it obviously did already for the other task groups. From my reproduction (details below), I have a suspicion that the problem is that the desired status is not getting set to stop for the terminal allocation, and that's dictating the behavior of the scheduler. Let me check in with some of my colleagues on this to see if we can figure out where the decision making is going wrong.


Reproduction (Very Long)

I've been able to reproduce the behavior you saw here with the following jobspec:

jobspec
job "example" {

  group "web" {
    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 100
        memory = 100
      }
    }
  }


  group "db" {

    reschedule {
      attempts  = 0
      unlimited = false
    }

    restart {
      attempts = 1
      delay    = "1s"
    }

    task "postgres" {

      driver = "docker"

      config {
        image   = "busybox:x"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 100
        memory = 100
      }
    }
  }


}

Note that I've disabled rescheduling but allowed 1 restart. If we run the job, the deployment fails as expected:

$ nomad job run ./example.nomad.hcl
==> 2023-09-13T16:27:29-04:00: Monitoring evaluation "6a6b8263"
    2023-09-13T16:27:29-04:00: Evaluation triggered by job "example"
    2023-09-13T16:27:30-04:00: Evaluation within deployment: "e8998267"
    2023-09-13T16:27:30-04:00: Allocation "04c2d55b" created: node "5273c01f", group "web"
    2023-09-13T16:27:30-04:00: Allocation "72ccb0df" created: node "5273c01f", group "db"
    2023-09-13T16:27:30-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T16:27:30-04:00: Evaluation "6a6b8263" finished with status "complete"
==> 2023-09-13T16:27:30-04:00: Monitoring deployment "e8998267"
  ! Deployment "e8998267" failed

    2023-09-13T16:27:35-04:00
    ID          = e8998267
    Job ID      = example
    Job Version = 0
    Status      = failed
    Description = Failed due to unhealthy allocations

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    db          1        1       0        1          2023-09-13T16:37:29-04:00
    web         1        1       0        0          2023-09-13T16:37:29-04:00

If we try to plan this again, it ignores both allocations as expected, because the task group hasn't changed:

$ nomad job plan example.nomad.hcl
Job: "example"
Task Group: "db" (1 ignore)
  Task: "postgres"

Task Group: "web" (1 ignore)
  Task: "http"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 11
To submit the job with version verification run:

nomad job run -check-index 11 example.nomad.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Note that at this point the job status is "running", the allocation status is "failed", and the task is "dead":

nomad job status
$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2023-09-13T16:27:29-04:00
Type          = service
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
db          0       0         0        1       0         0     0
web         0       0         1        0       0         0     0

Latest Deployment
ID          = e8998267
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
db          1        1       0        1          2023-09-13T16:37:29-04:00
web         1        1       1        0          2023-09-13T16:37:40-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
04c2d55b  5273c01f  web         0        run      running  1m17s ago  1m5s ago
72ccb0df  5273c01f  db          0        run      failed   1m17s ago  1m11s ago

$ nomad alloc status 72cc
ID                  = 72ccb0df-1c8c-4c69-e671-f061f68745b0
Eval ID             = 6a6b8263
Name                = example.db[0]
Node ID             = 5273c01f
Node Name           = continuity
Job ID              = example
Job Version         = 0
Client Status       = failed
Client Description  = Failed tasks
Desired Status      = run
Desired Description = <none>
Created             = 1m57s ago
Modified            = 1m51s ago
Deployment ID       = e8998267
Deployment Health   = unhealthy

Task "postgres" is "dead"
...

When we stop the job, both task groups show up in the deployment listing:

$ nomad job stop example
==> 2023-09-13T16:33:06-04:00: Monitoring evaluation "4cb0872c"
    2023-09-13T16:33:06-04:00: Evaluation triggered by job "example"
    2023-09-13T16:33:07-04:00: Evaluation within deployment: "e8998267"
    2023-09-13T16:33:07-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T16:33:07-04:00: Evaluation "4cb0872c" finished with status "complete"
==> 2023-09-13T16:33:07-04:00: Monitoring deployment "e8998267"
  ! Deployment "e8998267" failed

    2023-09-13T16:33:07-04:00
    ID          = e8998267
    Job ID      = example
    Job Version = 0
    Status      = failed
    Description = Failed due to unhealthy allocations

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    db          1        1       0        1          2023-09-13T16:37:29-04:00
    web         1        1       1        0          2023-09-13T16:37:40-04:00

And if we look at the job status after we've stopped, we see that the web task is Desired=stop and Status=complete, whereas the db task is Desired=run and Status=failed. I suspect this is where the problem is happening!

$ nomad job status example
...

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
04c2d55b  5273c01f  web         0        stop     complete 3m17s ago  1m5s ago
72ccb0df  5273c01f  db          0        run      failed   3m17s ago  1m11s ago

And if we look at the scheduler's debug logs, the scheduler didn't think anything needed to be done to the db group:

results= | Total changes: (place 0) (destructive 0) (inplace 0) (stop 1) (disconnect 0) (reconnect 0) | Desired Changes for "web": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0) | Desired Changes for "db": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

If we plan the job after stopping it, we see only the web task group is created:

$ nomad job plan example.nomad.hcl
+/- Job: "example"
+/- Stop: "true" => "false"
    Task Group: "db" (1 ignore)
      Task: "postgres"

    Task Group: "web" (1 create)
      Task: "http"

Scheduler dry-run:
- All tasks successfully allocated.
...

The scheduler's debug logs tell the same story:

results= | Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0) | Created Deployment: "df57fa02-eeb1-1e59-5c1c-930d0b0e4207" | Desired Changes for "web": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0) | Desired Changes for "db": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)

And then if we run the job, we can see exactly what you reported. Only the web task group gets started:

$ nomad job run ./example.nomad.hcl
==> 2023-09-13T16:33:33-04:00: Monitoring evaluation "5b8910cd"
    2023-09-13T16:33:33-04:00: Evaluation triggered by job "example"
    2023-09-13T16:33:33-04:00: Allocation "395e843d" created: node "5273c01f", group "web"
    2023-09-13T16:33:34-04:00: Evaluation within deployment: "8913a73b"
    2023-09-13T16:33:34-04:00: Allocation "395e843d" status changed: "pending" -> "running" (Tasks are running)
    2023-09-13T16:33:34-04:00: Evaluation status changed: "pending" -> "complete"
==> 2023-09-13T16:33:34-04:00: Evaluation "5b8910cd" finished with status "complete"
==> 2023-09-13T16:33:34-04:00: Monitoring deployment "8913a73b"
  ✓ Deployment "8913a73b" successful

    2023-09-13T16:33:46-04:00
    ID          = 8913a73b
    Job ID      = example
    Job Version = 2
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    web         1        1       1        0          2023-09-13T16:43:44-04:00

And then we'll verify that by looking at the job status:

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
395e843d  5273c01f  web         2        run      running   31s ago    20s ago
04c2d55b  5273c01f  web         0        stop     complete  6m36s ago  53s ago
72ccb0df  5273c01f  db          0        run      failed    6m36s ago  6m30s ago

If I change the reschedule block to the following I get the same behavior, except of course that there's an extra allocation when we're done:

reschedule {
  attempts  = 1
  interval = "15m"
  unlimited = false
}

And if we reproduce this whole thing with only the "db" group we get the similar behavior.

tgross avatar Sep 13 '23 21:09 tgross

Hi! Maybe there is some news? Seems to be a huge blocker at least to Nomad usage for orchestrating system jobs, that are used by a service job. By-design (nowadays), I need to stop&purge entire job (may be a proxy/dns cache/logging/monitoring) container for cluster. All time, that will be taken for re-deploying system job, may be a downtime for service job's load :(

heycarl avatar Nov 29 '25 00:11 heycarl