nomad
nomad copied to clipboard
Services not unregistered
Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg
Environment :
- Nomad 1.5.2 (with ACL), using the prebuilt binary
- Consul 1.15.1 (with ACL), using the pre-built binary
- Alma Linux 8
Haven't found yet a pattern to reproduce it 100% of the time
Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this.
1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the PostRun()
allocrunner hook is run, preventing further attempts at deregistration.
Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered.
I was just able to trigger it on my prometheus Job :
- I have a version of prom running, with 2 instances, and 2 services registered on Consul
- I change a config file, and run the job again, Nomad starts the rolling update
- In the end, I still have 2 instances running, two working services, and two failed services (corresponding to the two previous ones)
Here's my system logs during this rolling update :
Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks.
ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable)
I have seen the same issue, I've even reproduced it using the counter demo app. The issue only happens to me when ACL is enabled. nomad: 1.5.2-1 consul: 1.15.1-1 ubuntu: 20.04
deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul. exec into the downstream and curl $NOMAD_UPSTREAM_ADDR_servicename a number of times and you will see some return "connection reset by peer" as consul is now returning services that are "completed" in nomad.
hope this helps.
I have also reverted my lab to 1.5.1-1
I had the same problem and the way the NSD had the same problem
Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes. It seams to only affect services in service mesh in Consul. After the upgrade, Nomad services using the Connect Stanza with proxied upstreams, showed old versions of the deployment failing in Consul (v1.14.4).
Reverted to Nomad 1.4.5.
Regards.
Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more.
Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: https://github.com/hashicorp/nomad/issues/16739
Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information.
Those that have experienced this, are you setting the Consul agent ACL token via the consul acl set-agent-token
command, the API equivalent, or via the agent config? This is a requirement in Consul v1.15.0 and later.
It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem?
In my case, I set the token in the config file, like
acl {
enabled = true
enable_token_persistence = true
default_policy = "deny"
tokens {
default = "XXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
}
}
Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible)
We are also setting it via consul config:
{
"acl": {
"enabled": true,
"tokens": {
"agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
},
"default_policy": "deny",
"down_policy": "extend-cache",
"enable_token_persistence": true
}
}
This issue still present in nomad v 1.5.3
An observation from my side.
I created this bash script to go clean up the services that were not unregistered in consul as a interim solution.
#!/bin/bash
CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX
# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')
# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
service_id=$(echo "$check" | jq -r '.ServiceID')
node=$(echo "$check" | jq -r '.Node')
if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
else
echo "Skipping check with no associated service instance or node"
fi
done
This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.
This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.
This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.
Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.
This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.
1.5.3 have the same bug
We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.
{
"acl": {
"default_policy": "deny",
"down_policy": "allow",
"enabled": true,
"tokens": {
"agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
}
},
}
We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.
{ "acl": { "default_policy": "deny", "down_policy": "allow", "enabled": true, "tokens": { "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" } }, }
I have no acl enabled. I don't think that this issue belongs to the acl system.
Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul 1.15.2
and Nomad 1.5.3
...
I had this issue with nomad service provider without ACLs, the context details in #16890.
ubuntu 22.04.2 LTS nomad 1.5.3 docker 23.0.3
encountered this as well, I am able to reproduce quite reliably with the following sequence:
-
Restart Alloc
-
immediately after
Stop Alloc
using nomad 1.4.7.
![image](https://user-images.githubusercontent.com/21690857/233492899-7e05e4e1-a411-416e-8fe1-9fd7e79a4c94.png)
the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed
suspecting it could be related to https://github.com/hashicorp/nomad/issues/16289, but havent confirm
update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps
https://github.com/hashicorp/nomad/pull/16905 should contain a fix for this - I've checked with a simple alloc restart
and job stop
repro described by @fredwangwang, but if other folks want to build the branch and confirm that would be helpful.
@shoenig thanks for the fix!
Our issue is back again with Nomad version 1.5.5. Issue was re-introduce when we stop (and purged) a job and re-deployed the job. If more information is required please let me know.
@rgruyters a fix went into 1.5.6
Reproduced this in version 1.6.1 with ACL ((
Reproduced this in version 1.6.1 with ACL ((
Oh no, sorry this is still happening to you.
Would you be able to provide some reproduction steps?
https://github.com/hashicorp/nomad/issues/18203 reports a similar issue and with Nomad 1.6.1 as well, so there still seems to be a problem with services not being unregistered (either in Consul or Nomad). I'm going to reopen this one.
I am running solo nomad with ACL (no consul) and could repro this (See #18203 as mentioned previously)
Running into this issue as well. CONSULVERSION=1.16.1 and NOMADVERSION=1.6.1