nomad
nomad copied to clipboard
Allocation stopped immediately after start
Nomad version
1.3.6
Issue
Sometimes during deployment I see that freshly started allocation immediately stopped by nomad. During the same deployment multiple allocations with identical alloc index (name) created. Example, all allocs share same Eval ID and Job Version (full alloc status here: https://gist.github.com/valodzka/231f3942203a5f39528b52241905a06e):
Alloc 1:
ID = d1421efd-6b25-295a-4832-348c09955a8f
Name = bk.client[35]
Client Status = complete
Desired Description = alloc not needed due to job update
....
Recent Events:
Time Type Description
2022-10-07T15:09:45Z Killing Sent interrupt. Waiting 2m0s before force killing
2022-10-07T15:09:45Z Received Task received by client
Alloc 2:
ID = f278d901-b4d0-8b7b-8f20-4f1c46a36a10
Name = bk.client[24]
Client Status = running
Recent Events:
Time Type Description
2022-10-07T15:10:00Z Started Task started by client
2022-10-07T15:09:57Z Task Setup Building Task Directory
2022-10-07T15:09:45Z Received Task received by client
Placement Metrics
* Constraint "${meta.only_low_cpu_tasks} = false": 1 nodes excluded by filter
* Resources exhausted on 31 nodes
* Dimension "network: port collision" exhausted on 1 nodes
* Dimension "cpu" exhausted on 13 nodes
* Dimension "memory" exhausted on 17 nodes
Alloc 3:
ID = 246dec93-d65e-9498-1d69-816d727083d7
Name = bk.client[24]
Client Status = running
Recent Events:
Time Type Description
2022-10-07T15:09:57Z Started Task started by client
2022-10-07T15:09:57Z Task Setup Building Task Directory
2022-10-07T15:09:45Z Received Task received by client
What is strange here:
- Alloc 2 and 3 have identical alloc index/name (with same job version and no canaries used)
- Alloc 1 killed the same second as it was started
- Alloc 1 "Desired description" is "alloc not needed due to job update" but I'm sure that job definition wasn't updated
- Alloc 2 has
Dimension "network: port collision" exhausted on 1 nodes
. Not sure if it's relevant here but I guess it might be related.
Issue #12797 might be relevant.
Expected Result
- Alloc shouldn't be killed without reason
- No allocs with same job version should share same alloc index
Nomad Server/Clients logs
I checked logs for both client and server and didn't found anything relevant.
Hi @valodzka and thanks for raising this issue.
Alloc 2 and 3 have identical alloc index/name
This certainly looks identical to #12797 as you pointed out, so cross linked from this issue is useful to provide additional context when that item of work gets roadmapped.
alloc not needed due to job update
This message is used during the allocation reconciliation phase which is triggered when the scheduler receives an updated job specification. It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?
I checked logs for both client and server and didn't found anything relevant.
If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context. There may be some non-obvious messages which might provide an idea of what is happening also.
It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?
We use simple automated system that call flock <lock-file> nomad job run <job-file>
for deployment. flock
used to block concurrent deployments. So I'm pretty sure that there were not any updates. Also as you can see in full alloc status Job Version
remaned the same (1654). And if something modified job version would be increased.
If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context.
Leader:
2022-10-07T15:09:44.828Z [INFO] nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=608e656f-e3b9-6ea0-ea99-c4dfc6bc02e6 reason="reserved alloc port collision: collision when reserving port for alloc 2ac62f30-a072-d797-8023-b6bf089f2ac0: [port 30261 already in use]" eval_id=6cdc78f2-8fa1-cc00-193f-45e923472f6a namespace=default
Client:
2022-10-07T15:09:55.478Z [INFO] client.driver_mgr.docker: stopped container: container_id=9e6d72ce5f5eced40e72a8f8fd88ccd6b8d54007394901680591407fd8089ab8 driver=docker
2022-10-07T15:09:55.814Z [INFO] agent: (runner) stopping
2022-10-07T15:09:55.814Z [INFO] client.gc: marking allocation for GC: alloc_id=75f2d3c4-946e-3977-83d3-d074a12102b4
2022-10-07T15:09:55.816Z [INFO] agent: (runner) received finish
2022-10-07T15:09:55.906Z [INFO] client.gc: marking allocation for GC: alloc_id=d1421efd-6b25-295a-4832-348c09955a8f
2022-10-07T15:10:03.655Z [INFO] client.driver_mgr.docker: stopped container: container_id=7ff101994a24366c00b9184b6555426efb33c56f0186c22152dda0fb9c754df1 driver=docker
2022-10-07T15:10:03.684Z [INFO] agent: (runner) stopping
2022-10-07T15:10:03.684Z [INFO] client.gc: marking allocation for GC: alloc_id=6256dfb2-a253-d644-70fe-45cfb4867171
2022-10-07T15:10:03.684Z [INFO] agent: (runner) received finish
2022-10-07T15:10:03.697Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.698Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.711Z [INFO] agent: (runner) creating new runner (dry: false, once: false)
2022-10-07T15:10:03.711Z [INFO] agent: (runner) creating watcher
2022-10-07T15:10:03.712Z [INFO] agent: (runner) starting
2022-10-07T15:10:03.893Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2
2022-10-07T15:10:04.015Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2
The issue appeared again. Unfortunately allocations were garbaraga collected before I can inspect them, but in the leader log I see the same message (and it first time it appeared in log since 2022-10-07T15:09:44.828Z), so it might be related:
2022-10-10T18:39:31.503Z [INFO] nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default
Hi @jrasell, I tried to create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce. Steps:
git clone -b issue_14850 [email protected]:valodzka/nomad-vagrant-lab.git
cd nomad-vagrant-lab
vagrant up
vagrant ssh nomad-a-1
bash /vagrant/launch-a-1.sh
# repeat ssh/launch it a-2, a-3
bash -x reproduce_14850.sh
# wait few dozen iterations
nomad-vagrant-lab
is a basic nomad/consul cluster template without any tweaks. reproduce_14850.sh
runs deployments of simple job with 50 allocations, each of them using 11 dynamic ports. I tried twice and on my machine (old mac book pro) it took around 20-30 deployments to trigger the issue.
Hi @valodzka and thanks for the additional details and context.
nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default
This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.
create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce
This is very useful and thanks for taking the time to create this. It will prove very useful further down the line to engineers looking into this issue.
I am going to move this issue onto our backlog now from initial triage.
This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.
@jrasell thank you for reply. I already read the documentation about this log message and if I understood correctly it is about a different bug. It states that "it is possible for these log lines to occur infrequently due to normal cluster conditions", and the problem is only if there are "repeated log lines". I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case?