nomad
nomad copied to clipboard
Nomad marks deployments as successful, before Consul checks are passed.
Nomad version and environment details
Nomad 1.1.2 -> 1.4.2 Consul 1.9.5 Ubuntu 20.04
Environment configuration:
Servers: 3 nomad servers, consul servers (services register here) + consul servers with ACL. Clients: we've installed nomad client and consul agent.
Issue
After upgrading Nomad from 1.1.2 to 1.4.2 some deployments are marked as successful, despite that service has not yet been registered with Consul. This problem is reproduced when running several deployments at the same time.
Reproduction steps
- Make docker image of the service, which runs 3 seconds after starting and then fails.
- Run several deployments via script - nomad-reproducer.go.txt
Expected Behaviour
All deployments fail.
Actual Behaviour
Few deployments are successful.
Logs for one allocation for successful deployment
Nomad start to process deployment:
nomad[2668973]: 2022-12-09T13:34:35.361+0300 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task
nomad[2668973]: 2022-12-09T13:34:35.422+0300 [WARN] client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task
nomad[2668973]: 2022-12-09T13:34:35.422+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/usr/bin/nomad args=["/usr/bin/nomad", "logmon"]
nomad[2668973]: 2022-12-09T13:34:35.422+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/usr/bin/nomad pid=1424417
nomad[2668973]: 2022-12-09T13:34:35.422+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/usr/bin/nomad
nomad[2668973]: 2022-12-09T13:34:35.435+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon address=/tmp/plugin1614056546 network=unix timestamp="2022-12-09T13:34:35.435+0300"
nomad[2668973]: 2022-12-09T13:34:35.435+0300 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task version=2
nomad[2668973]: 2022-12-09T13:34:35.442+0300 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stdout.fifo timestamp="2022-12-09T13:34:35.442+0300"
nomad[2668973]: 2022-12-09T13:34:35.443+0300 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stderr.fifo timestamp="2022-12-09T13:34:35.443+0300"
nomad[2668973]: 2022-12-09T13:34:35.534+0300 [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task binds="[]string{\"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc:/alloc\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/local:/local\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/secrets:/secrets\", \"/etc/docker/resolv.conf:/etc/resolv.conf:ro\"}"
nomad[2668973]: 2022-12-09T13:34:35.535+0300 [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task labels=map[com.hashicorp.nomad.alloc_id:384708cc-5ca5-1e14-ca6b-7017045ee3c7]
nomad[2668973]: 2022-12-09T13:34:35.535+0300 [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task container_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7
Dockerd start container:
dockerd[60232]: 2022-12-09T13:34:35.535621200+03:00 level=debug msg="Calling POST /containers/create?name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7"
dockerd[60232]: 2022-12-09T13:34:51.808265565+03:00 level=debug msg="Assigning addresses for endpoint shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7's interface on network vertis"
Container fail:
dockerd[60232]: 2022-12-09T13:34:57.848499016+03:00 level=debug msg="Revoking external connectivity on endpoint shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7 (0469b6ae1bb562a08331195c330fb5fb05e231499861b4f5927ddc25020ef9cc)"
dockerd[60232]: 2022-12-09T13:35:03.354392690+03:00 level=debug msg="Releasing addresses for endpoint shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7's interface on network vertis"
Nomad marks deployment as successful:
2022-12-09 13:35:06.120 Deployment Update for ID "cae9bf7e-d534-eab4-a963-7e68893d8474": Status "successful"; Description "Deployment completed successfully"
_PID=241887 _SYSTEMD_UNIT=nomad.service
Container restart due to restart policy:
nomad[2668973]: 2022-12-09T13:35:06.778+0300 [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task reason="Restart within policy" delay=1.155116655s
nomad[2668973]: 2022-12-09T13:35:07.938+0300 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task
nomad[2668973]: 2022-12-09T13:35:07.940+0300 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task @module=logmon path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stdout.fifo timestamp="2022-12-09T13:35:07.940+0300"
nomad[2668973]: 2022-12-09T13:35:07.940+0300 [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7 task=shiva-ci-TestNomad-Run-Fail-test-25-2704-task path=/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc/logs/.shiva-ci-TestNomad-Run-Fail-test-25-2704-task.stderr.fifo @module=logmon timestamp="2022-12-09T13:35:07.940+0300"
nomad[2668973]: 2022-12-09T13:35:07.946+0300 [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task binds="[]string{\"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/alloc:/alloc\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/local:/local\", \"/var/lib/nomad/alloc/384708cc-5ca5-1e14-ca6b-7017045ee3c7/shiva-ci-TestNomad-Run-Fail-test-25-2704-task/secrets:/secrets\", \"/etc/docker/resolv.conf:/etc/resolv.conf:ro\"}"
nomad[2668973]: 2022-12-09T13:35:07.946+0300 [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task labels=map[com.hashicorp.nomad.alloc_id:384708cc-5ca5-1e14-ca6b-7017045ee3c7]
nomad[2668973]: 2022-12-09T13:35:07.946+0300 [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task container_name=shiva-ci-TestNomad-Run-Fail-test-25-2704-task-384708cc-5ca5-1e14-ca6b-7017045ee3c7
First Consul check logs for alloc_id=384708cc-5ca5-1e14-ca6b-7017045ee3c7:
consul[48541]: 2022-12-09T13:36:17.347+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.599+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.636+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.684+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.685+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.717+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.732+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.772+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.817+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.818+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.855+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.858+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.877+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.879+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.880+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.898+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.981+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:17.996+0300 [DEBUG] agent: Check in sync: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
consul[48541]: 2022-12-09T13:36:18.120+0300 [WARN] agent: Check is now critical: check=_nomad-check-a73262881d59e860d85f2ba206bc9adbd8c3d3eb
@shoenig Hi! Could you please look at this issue?
Hi folks, we've marked this issue for triage.
Hi, any updates here? Unfortunately, we can't use this version in production and downgrades are not the strong side of nomad. Do you need more details?
Hi folks, sorry for missing this issue before. I am able to reproduce fairly easily by fudging our use of Consul API to pretend like checks not being registered on time. Will find a fix and backport in the next bugfix release set.
+++ b/command/agent/consul/service_client.go
@@ -1474,7 +1474,10 @@ func (c *ServiceClient) AllocRegistrations(allocID string) (*serviceregistration
sreg.Service = services[serviceID]
for checkID := range sreg.CheckIDs {
if check, ok := checks[checkID]; ok {
- sreg.Checks = append(sreg.Checks, check)
+
+ // pretend like check not in consul yet
+ _ = check
+ // sreg.Checks = append(sreg.Checks, check)
}