nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Forcing a placement with failed deployment

Open hynek opened this issue 8 years ago • 14 comments

Nomad version

Nomad v0.6.0

Operating system and Environment details

Ubuntu Xenial running in LXD

Issue

So this is a bit more obscure than I initially thought.

We had a bit of a rough time (hilariously, consul running an LXD container hanging up a whole metal server) and I had to kill off a node in the middle of a deployment because it just hang while supposedly downloading a docker container.

At this point it had already placed an alloc on another node however I can't make it to re-try to place the second one again. Both nomad run and nomad plan just pretend like everything is fine.

I was able to fill up clients that we lost during an outage tonight so it seems to be specific to the failed deployment?

I could only fix it by forcing a change in the plan (I just rebuilt the container).

Nomad Status

$ nomad status enduser
ID            = enduser
Name          = enduser
Submit Date   = 08/23/17 21:28:51 CEST
Type          = service
Priority      = 50
Datacenters   = scaleup
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
enduser     0       0         1        0       2         1

Latest Deployment
ID          = 6f7b3785
Status      = failed
Description = Failed due to unhealthy allocations

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
enduser     2        2       1        1

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
6c01a752  b77a68ed  enduser     3        stop     lost      08/24/17 10:11:16 CEST
e2693daa  60101861  enduser     3        run      running   08/24/17 10:11:16 CEST
db58bf99  c7caeb6b  enduser     3        run      complete  08/24/17 05:14:15 CEST
2f2c1d45  c7caeb6b  enduser     3        run      complete  08/23/17 21:28:52 CEST

Nomad plan

Job: "enduser"
Task Group: "enduser" (1 ignore)
  Task: "enduser"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 858133
To submit the job with version verification run:

nomad run -check-index 858133 _enduser.hcl

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Nomad run

$ nomad run _enduser.hcl
==> Monitoring evaluation "788e8161"
    Evaluation triggered by job "enduser"
    Evaluation within deployment: "6f7b3785"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "788e8161" finished with status "complete"

Nomad alloc-status for the broken alloc

$  nomad alloc-status 6c01a752
ID                  = 6c01a752
Eval ID             = bf04b0f5
Name                = enduser.enduser[1]
Node ID             = b77a68ed
Job ID              = enduser
Job Version         = 3
Client Status       = failed
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 08/24/17 10:11:16 CEST
Deployment ID       = 6f7b3785
Deployment Health   = unhealthy

Task "enduser" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
500 MHz  128 MiB  300 MiB  0     http: 10.6.32.3:29213

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                    Type        Description
08/24/17 10:56:55 CEST  Killing     Killing task: vault: failed to derive token: Can't request Vault token for terminal allocation
08/24/17 10:16:17 CEST  Driver      Downloading image docker.XXX/enduser:1136
08/24/17 10:11:17 CEST  Task Setup  Building Task Directory
08/24/17 10:11:17 CEST  Received    Task received by client

Nomad eval-status

nomad eval-status 788e8161
ID                 = 788e8161
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = job-register
Job ID             = enduser
Priority           = 50
Placement Failures = false

Job file (if appropriate)

job "enduser" {
  datacenters = ["scaleup"]

  update {
    max_parallel = 1
  }

  group "enduser" {
    count = 2

    task "enduser" {
      driver = "docker"

      config {
        image = "https://docker.XXX/enduser:1136"
        port_map = {
          http = 8000
        }
      }

      env {
        APP_ENV = "prod"
      }

      service {
        name = "enduser"
        port = "http"
        tags = [
          "env-prod",
        ]

        check {
          type     = "http"
          protocol = "https"
          path     = "/mail/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        cpu    = 500
        memory = 128

        network {
          mbits = 10

          port "http" {}
        }
      }

      vault {
        policies = ["enduser-prod"]
      }
    }
  }
}

hynek avatar Aug 24 '17 11:08 hynek

@hynek Yeah this is an interesting one. Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.

Potentially need a nomad run -force command to override.

dadgar avatar Aug 25 '17 00:08 dadgar

@dadgar we need this too! This is what happened:

  • our task count is something like 80 and we run docker containers on a bunch of underlying EC2 servers.
  • a server was reaped and a new one was added. A bug in the init script prevented it from joining the nomad worker pool
  • Nomad then refused to deploy newer versions of our tasks (new docker images) because it had insufficient resources (CPU)
  • This caused our hybrid cluster (Nomad + non-nomad) to have different versions deployed to production

urjitbhatia avatar May 18 '18 00:05 urjitbhatia

Our task fails due to broken connection to underlying database and causes the allocation to be in failure state. A nomad job run wouldn't allow me to bring it back (after fixing the underlying database issue). I have to stop the task, wait for it to be stopped and rerun the job.

I'd like to be able to restart the job without killing all tasks.

xeroc avatar Sep 20 '21 06:09 xeroc

When docker's storage is on a NAS that happens to freeze during a deployment, the deployment will fail (wouldn't expect otherwise). After fixing the NAS I'd like to re-deploy without having to alter the job file, which is not possible for now.

In short: being able to restart job allocation without killing all tasks would prevent downtime when issues originate from other sources.

finwo avatar Nov 22 '21 13:11 finwo

Is there a way to even do this currently? I tried all nomad deployment commands and it said it cannot do something with the terminal deployment (can't resume terminal deployment, etc.)

gregory112 avatar Jan 13 '22 07:01 gregory112

@gregory112 deployments that are complete won't ever get run again. Depending on your specific circumstance the nomad alloc stop command may be able to help you out here by forcing a reschedule of a broken allocation.

tgross avatar Jan 13 '22 14:01 tgross

I give +1 for the nomad job run -force then as it does really help in case there are numerous of allocations that fail, especially those that have more than one instances. We use CI server to deploy most of the jobs and so manually interacting with allocations and stopping them is quite a chore

gregory112 avatar Jan 16 '22 12:01 gregory112

If my understanding is correct, a deployment with auto_revert disabled, on a job spec that only reschedules (and doesn't restart), on a long enough timeline will result in the number of running tasks in that deployment becoming 0.

@dadgar -

Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.

Is this called out in the docs anywhere? I just found out this behaviour is the source of some long running problems I'm experiencing, and don't want anyone else to have the same issues.

lattwood avatar Sep 11 '23 17:09 lattwood

@lattwood as it turns out we were just talking about that internally and we definitely want to put together a doc that brings together all of deployments, reschedule, restart, and update blocks.

tgross avatar Sep 11 '23 17:09 tgross

Did this ever get resolved or mitigated?

  • A node was down and thus the deployment failed due to being unable to place the last alloc.
  • A plan shows:

    Task Group: "dns" (1 create, 4 in-place update)

  • But, a run immediately says:

    Deployment "40a3aa12" failed

  • Which while technically correct, isn't very useful since that was over a day ago. No amount of system gc or any other trickery will get this complete ban of this job version lifted and allow doing what plan (correctly) promises me is the correct action.

As others above, I find it quite "odd" the proposed solution is to turn down the entire job from the whole cluster just to be able to bring it up again on all relevant nodes. I could also do the usual trickery and add to the job:

meta {
   just = doit
}

... But that's then a full redeployment interrupting the perfectly fine allocs already doing their job.

jinnatar avatar Aug 21 '25 18:08 jinnatar

I'd like to also see what's going on with this issue. I've just ran into it where a node was down during a deployment, and bringing the node up hasn't really recovered back to a healthy state - because the deployment failed. Is this on a roadmap?

ocharles avatar Nov 24 '25 07:11 ocharles

Hi @ocharles there is no further update at this time from the team. The issue is on our backlog and when it gets prioritised a member of the team will assign this issue to themselves or add further updates.

jrasell avatar Nov 24 '25 07:11 jrasell

Thanks for the speedy response @jrasell! I appreciate "status update" comments are not helpful, but this is the number one issue that causes the rest of the team to lose confidence in Nomad (as it's not very obvious what's going on), so it's quite important to me. Sorry for the noise!

ocharles avatar Nov 24 '25 07:11 ocharles

Hi @jrasell, thanks for you response here! Maybe there is some logical issue: https://github.com/hashicorp/nomad/blob/9a288ef493fc1ac5a621a79a35c3d1d4ed165df2/nomad/deploymentwatcher/deployment_watcher.go#L561-L581

New manifest version has been submitted as a revert procedure step. Older deployment status changed to DeploymentStatusFailed, but new deployment with revert "commit" is not created yet. So, we've got stuck on previous failed deploy.

heycarl avatar Nov 29 '25 00:11 heycarl