Forcing a placement with failed deployment
Nomad version
Nomad v0.6.0
Operating system and Environment details
Ubuntu Xenial running in LXD
Issue
So this is a bit more obscure than I initially thought.
We had a bit of a rough time (hilariously, consul running an LXD container hanging up a whole metal server) and I had to kill off a node in the middle of a deployment because it just hang while supposedly downloading a docker container.
At this point it had already placed an alloc on another node however I can't make it to re-try to place the second one again. Both nomad run and nomad plan just pretend like everything is fine.
I was able to fill up clients that we lost during an outage tonight so it seems to be specific to the failed deployment?
I could only fix it by forcing a change in the plan (I just rebuilt the container).
Nomad Status
$ nomad status enduser
ID = enduser
Name = enduser
Submit Date = 08/23/17 21:28:51 CEST
Type = service
Priority = 50
Datacenters = scaleup
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
enduser 0 0 1 0 2 1
Latest Deployment
ID = 6f7b3785
Status = failed
Description = Failed due to unhealthy allocations
Deployed
Task Group Desired Placed Healthy Unhealthy
enduser 2 2 1 1
Allocations
ID Node ID Task Group Version Desired Status Created At
6c01a752 b77a68ed enduser 3 stop lost 08/24/17 10:11:16 CEST
e2693daa 60101861 enduser 3 run running 08/24/17 10:11:16 CEST
db58bf99 c7caeb6b enduser 3 run complete 08/24/17 05:14:15 CEST
2f2c1d45 c7caeb6b enduser 3 run complete 08/23/17 21:28:52 CEST
Nomad plan
Job: "enduser"
Task Group: "enduser" (1 ignore)
Task: "enduser"
Scheduler dry-run:
- All tasks successfully allocated.
Job Modify Index: 858133
To submit the job with version verification run:
nomad run -check-index 858133 _enduser.hcl
When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
Nomad run
$ nomad run _enduser.hcl
==> Monitoring evaluation "788e8161"
Evaluation triggered by job "enduser"
Evaluation within deployment: "6f7b3785"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "788e8161" finished with status "complete"
Nomad alloc-status for the broken alloc
$ nomad alloc-status 6c01a752
ID = 6c01a752
Eval ID = bf04b0f5
Name = enduser.enduser[1]
Node ID = b77a68ed
Job ID = enduser
Job Version = 3
Client Status = failed
Client Description = <none>
Desired Status = stop
Desired Description = alloc is lost since its node is down
Created At = 08/24/17 10:11:16 CEST
Deployment ID = 6f7b3785
Deployment Health = unhealthy
Task "enduser" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
500 MHz 128 MiB 300 MiB 0 http: 10.6.32.3:29213
Task Events:
Started At = N/A
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
08/24/17 10:56:55 CEST Killing Killing task: vault: failed to derive token: Can't request Vault token for terminal allocation
08/24/17 10:16:17 CEST Driver Downloading image docker.XXX/enduser:1136
08/24/17 10:11:17 CEST Task Setup Building Task Directory
08/24/17 10:11:17 CEST Received Task received by client
Nomad eval-status
nomad eval-status 788e8161
ID = 788e8161
Status = complete
Status Description = complete
Type = service
TriggeredBy = job-register
Job ID = enduser
Priority = 50
Placement Failures = false
Job file (if appropriate)
job "enduser" {
datacenters = ["scaleup"]
update {
max_parallel = 1
}
group "enduser" {
count = 2
task "enduser" {
driver = "docker"
config {
image = "https://docker.XXX/enduser:1136"
port_map = {
http = 8000
}
}
env {
APP_ENV = "prod"
}
service {
name = "enduser"
port = "http"
tags = [
"env-prod",
]
check {
type = "http"
protocol = "https"
path = "/mail/"
interval = "10s"
timeout = "2s"
}
}
resources {
cpu = 500
memory = 128
network {
mbits = 10
port "http" {}
}
}
vault {
policies = ["enduser-prod"]
}
}
}
}
@hynek Yeah this is an interesting one. Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.
Potentially need a nomad run -force command to override.
@dadgar we need this too! This is what happened:
- our task count is something like 80 and we run docker containers on a bunch of underlying EC2 servers.
- a server was reaped and a new one was added. A bug in the init script prevented it from joining the nomad worker pool
- Nomad then refused to deploy newer versions of our tasks (new docker images) because it had insufficient resources (CPU)
- This caused our hybrid cluster (Nomad + non-nomad) to have different versions deployed to production
Our task fails due to broken connection to underlying database and causes the allocation to be in failure state. A nomad job run wouldn't allow me to bring it back (after fixing the underlying database issue). I have to stop the task, wait for it to be stopped and rerun the job.
I'd like to be able to restart the job without killing all tasks.
When docker's storage is on a NAS that happens to freeze during a deployment, the deployment will fail (wouldn't expect otherwise). After fixing the NAS I'd like to re-deploy without having to alter the job file, which is not possible for now.
In short: being able to restart job allocation without killing all tasks would prevent downtime when issues originate from other sources.
Is there a way to even do this currently? I tried all nomad deployment commands and it said it cannot do something with the terminal deployment (can't resume terminal deployment, etc.)
@gregory112 deployments that are complete won't ever get run again. Depending on your specific circumstance the nomad alloc stop command may be able to help you out here by forcing a reschedule of a broken allocation.
I give +1 for the nomad job run -force then as it does really help in case there are numerous of allocations that fail, especially those that have more than one instances. We use CI server to deploy most of the jobs and so manually interacting with allocations and stopping them is quite a chore
If my understanding is correct, a deployment with auto_revert disabled, on a job spec that only reschedules (and doesn't restart), on a long enough timeline will result in the number of running tasks in that deployment becoming 0.
@dadgar -
Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.
Is this called out in the docs anywhere? I just found out this behaviour is the source of some long running problems I'm experiencing, and don't want anyone else to have the same issues.
@lattwood as it turns out we were just talking about that internally and we definitely want to put together a doc that brings together all of deployments, reschedule, restart, and update blocks.
Did this ever get resolved or mitigated?
- A node was down and thus the deployment failed due to being unable to place the last alloc.
- A
planshows:Task Group: "dns" (1 create, 4 in-place update)
- But, a
runimmediately says:Deployment "40a3aa12" failed
- Which while technically correct, isn't very useful since that was over a day ago. No amount of
system gcor any other trickery will get this complete ban of this job version lifted and allow doing what plan (correctly) promises me is the correct action.
As others above, I find it quite "odd" the proposed solution is to turn down the entire job from the whole cluster just to be able to bring it up again on all relevant nodes. I could also do the usual trickery and add to the job:
meta {
just = doit
}
... But that's then a full redeployment interrupting the perfectly fine allocs already doing their job.
I'd like to also see what's going on with this issue. I've just ran into it where a node was down during a deployment, and bringing the node up hasn't really recovered back to a healthy state - because the deployment failed. Is this on a roadmap?
Hi @ocharles there is no further update at this time from the team. The issue is on our backlog and when it gets prioritised a member of the team will assign this issue to themselves or add further updates.
Thanks for the speedy response @jrasell! I appreciate "status update" comments are not helpful, but this is the number one issue that causes the rest of the team to lose confidence in Nomad (as it's not very obvious what's going on), so it's quite important to me. Sorry for the noise!
Hi @jrasell, thanks for you response here! Maybe there is some logical issue: https://github.com/hashicorp/nomad/blob/9a288ef493fc1ac5a621a79a35c3d1d4ed165df2/nomad/deploymentwatcher/deployment_watcher.go#L561-L581
New manifest version has been submitted as a revert procedure step. Older deployment status changed to DeploymentStatusFailed, but new deployment with revert "commit" is not created yet. So, we've got stuck on previous failed deploy.