nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Allocation stopped immediately after start

Open valodzka opened this issue 1 year ago • 6 comments

Nomad version

1.3.6

Issue

Sometimes during deployment I see that freshly started allocation immediately stopped by nomad. During the same deployment multiple allocations with identical alloc index (name) created. Example, all allocs share same Eval ID and Job Version (full alloc status here: https://gist.github.com/valodzka/231f3942203a5f39528b52241905a06e):

Alloc 1:

ID                  = d1421efd-6b25-295a-4832-348c09955a8f
Name                = bk.client[35]
Client Status       = complete
Desired Description = alloc not needed due to job update
....

Recent Events:
Time                  Type      Description
2022-10-07T15:09:45Z  Killing   Sent interrupt. Waiting 2m0s before force killing
2022-10-07T15:09:45Z  Received  Task received by client

Alloc 2:

ID                  = f278d901-b4d0-8b7b-8f20-4f1c46a36a10
Name                = bk.client[24]
Client Status       = running

Recent Events:
Time                  Type        Description
2022-10-07T15:10:00Z  Started     Task started by client
2022-10-07T15:09:57Z  Task Setup  Building Task Directory
2022-10-07T15:09:45Z  Received    Task received by client

Placement Metrics
  * Constraint "${meta.only_low_cpu_tasks} = false": 1 nodes excluded by filter
  * Resources exhausted on 31 nodes
  * Dimension "network: port collision" exhausted on 1 nodes
  * Dimension "cpu" exhausted on 13 nodes
  * Dimension "memory" exhausted on 17 nodes

Alloc 3:

ID                  = 246dec93-d65e-9498-1d69-816d727083d7
Name                = bk.client[24]
Client Status       = running

Recent Events:
Time                  Type        Description
2022-10-07T15:09:57Z  Started     Task started by client
2022-10-07T15:09:57Z  Task Setup  Building Task Directory
2022-10-07T15:09:45Z  Received    Task received by client

What is strange here:

  1. Alloc 2 and 3 have identical alloc index/name (with same job version and no canaries used)
  2. Alloc 1 killed the same second as it was started
  3. Alloc 1 "Desired description" is "alloc not needed due to job update" but I'm sure that job definition wasn't updated
  4. Alloc 2 has Dimension "network: port collision" exhausted on 1 nodes. Not sure if it's relevant here but I guess it might be related.

Issue #12797 might be relevant.

Expected Result

  1. Alloc shouldn't be killed without reason
  2. No allocs with same job version should share same alloc index

Nomad Server/Clients logs

I checked logs for both client and server and didn't found anything relevant.

valodzka avatar Oct 07 '22 19:10 valodzka

Hi @valodzka and thanks for raising this issue.

Alloc 2 and 3 have identical alloc index/name

This certainly looks identical to #12797 as you pointed out, so cross linked from this issue is useful to provide additional context when that item of work gets roadmapped.

alloc not needed due to job update

This message is used during the allocation reconciliation phase which is triggered when the scheduler receives an updated job specification. It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?

I checked logs for both client and server and didn't found anything relevant.

If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context. There may be some non-obvious messages which might provide an idea of what is happening also.

jrasell avatar Oct 10 '22 07:10 jrasell

It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?

We use simple automated system that call flock <lock-file> nomad job run <job-file> for deployment. flock used to block concurrent deployments. So I'm pretty sure that there were not any updates. Also as you can see in full alloc status Job Version remaned the same (1654). And if something modified job version would be increased.

If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context.

Leader:

2022-10-07T15:09:44.828Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=608e656f-e3b9-6ea0-ea99-c4dfc6bc02e6 reason="reserved alloc port collision: collision when reserving port for alloc 2ac62f30-a072-d797-8023-b6bf089f2ac0: [port 30261 already in use]" eval_id=6cdc78f2-8fa1-cc00-193f-45e923472f6a namespace=default

Client:

2022-10-07T15:09:55.478Z [INFO]  client.driver_mgr.docker: stopped container: container_id=9e6d72ce5f5eced40e72a8f8fd88ccd6b8d54007394901680591407fd8089ab8 driver=docker
2022-10-07T15:09:55.814Z [INFO]  agent: (runner) stopping
2022-10-07T15:09:55.814Z [INFO]  client.gc: marking allocation for GC: alloc_id=75f2d3c4-946e-3977-83d3-d074a12102b4
2022-10-07T15:09:55.816Z [INFO]  agent: (runner) received finish
2022-10-07T15:09:55.906Z [INFO]  client.gc: marking allocation for GC: alloc_id=d1421efd-6b25-295a-4832-348c09955a8f
2022-10-07T15:10:03.655Z [INFO]  client.driver_mgr.docker: stopped container: container_id=7ff101994a24366c00b9184b6555426efb33c56f0186c22152dda0fb9c754df1 driver=docker
2022-10-07T15:10:03.684Z [INFO]  agent: (runner) stopping
2022-10-07T15:10:03.684Z [INFO]  client.gc: marking allocation for GC: alloc_id=6256dfb2-a253-d644-70fe-45cfb4867171
2022-10-07T15:10:03.684Z [INFO]  agent: (runner) received finish
2022-10-07T15:10:03.697Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.698Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.711Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-10-07T15:10:03.711Z [INFO]  agent: (runner) creating watcher
2022-10-07T15:10:03.712Z [INFO]  agent: (runner) starting
2022-10-07T15:10:03.893Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2
2022-10-07T15:10:04.015Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2

valodzka avatar Oct 10 '22 10:10 valodzka

The issue appeared again. Unfortunately allocations were garbaraga collected before I can inspect them, but in the leader log I see the same message (and it first time it appeared in log since 2022-10-07T15:09:44.828Z), so it might be related:

2022-10-10T18:39:31.503Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default

valodzka avatar Oct 11 '22 10:10 valodzka

Hi @jrasell, I tried to create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce. Steps:

git clone -b issue_14850 [email protected]:valodzka/nomad-vagrant-lab.git
cd nomad-vagrant-lab
vagrant up
vagrant ssh nomad-a-1    
bash /vagrant/launch-a-1.sh
# repeat ssh/launch it a-2, a-3
bash -x reproduce_14850.sh
# wait few dozen iterations

nomad-vagrant-lab is a basic nomad/consul cluster template without any tweaks.  reproduce_14850.sh runs deployments of simple job with 50 allocations, each of them using 11 dynamic ports. I tried twice and on my machine (old mac book pro) it took around 20-30 deployments to trigger the issue.

valodzka avatar Oct 17 '22 10:10 valodzka

Hi @valodzka and thanks for the additional details and context.

nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default

This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.

create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce

This is very useful and thanks for taking the time to create this. It will prove very useful further down the line to engineers looking into this issue.

I am going to move this issue onto our backlog now from initial triage.

jrasell avatar Oct 17 '22 10:10 jrasell

This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.

@jrasell thank you for reply. I already read the documentation about this log message and if I understood correctly it is about a different bug. It states that "it is possible for these log lines to occur infrequently due to normal cluster conditions", and the problem is only if there are "repeated log lines". I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case?

valodzka avatar Oct 17 '22 10:10 valodzka