nomad Services not unregistered

Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg

Environment :

Nomad 1.5.2 (with ACL), using the prebuilt binary
Consul 1.15.1 (with ACL), using the pre-built binary
Alma Linux 8

Haven't found yet a pattern to reproduce it 100% of the time

Mar 23 '23 08:03 dani

Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this.

Mar 23 '23 08:03 jrasell

1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the PostRun() allocrunner hook is run, preventing further attempts at deregistration.

Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered.

Mar 23 '23 13:03 shoenig

I was just able to trigger it on my prometheus Job :

I have a version of prom running, with 2 instances, and 2 services registered on Consul
I change a config file, and run the job again, Nomad starts the rolling update
In the end, I still have 2 instances running, two working services, and two failed services (corresponding to the two previous ones)

Here's my system logs during this rolling update :

nomad_unreg.txt

Mar 23 '23 14:03 dani

Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks.

Mar 24 '23 12:03 jrasell

ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable)

Mar 24 '23 12:03 dani

I have seen the same issue, I've even reproduced it using the counter demo app. The issue only happens to me when ACL is enabled. nomad: 1.5.2-1 consul: 1.15.1-1 ubuntu: 20.04

deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul. exec into the downstream and curl $NOMAD_UPSTREAM_ADDR_servicename a number of times and you will see some return "connection reset by peer" as consul is now returning services that are "completed" in nomad.

hope this helps.

I have also reverted my lab to 1.5.1-1

Mar 24 '23 16:03 martdah

I had the same problem and the way the NSD had the same problem

Mar 24 '23 23:03 chenjpu

Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes. It seams to only affect services in service mesh in Consul. After the upgrade, Nomad services using the Connect Stanza with proxied upstreams, showed old versions of the deployment failing in Consul (v1.14.4).

Reverted to Nomad 1.4.5.

Regards.

Mar 25 '23 11:03 ngcmac

Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more.

Mar 27 '23 14:03 jrasell

Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: https://github.com/hashicorp/nomad/issues/16739

Mar 31 '23 17:03 tgross

Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information.

Those that have experienced this, are you setting the Consul agent ACL token via the consul acl set-agent-token command, the API equivalent, or via the agent config? This is a requirement in Consul v1.15.0 and later.

It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem?

Apr 03 '23 13:04 jrasell

In my case, I set the token in the config file, like

acl {
  enabled = true
  enable_token_persistence = true
  default_policy = "deny"
  tokens {
    default = "XXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
  }
}

Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible)

Apr 03 '23 14:04 dani

We are also setting it via consul config:

{
  "acl": {
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    },
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true
  }
}

Apr 05 '23 08:04 ngcmac

This issue still present in nomad v 1.5.3

Apr 05 '23 20:04 suikast42

An observation from my side.

I created this bash script to go clean up the services that were not unregistered in consul as a interim solution.

#!/bin/bash

CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX

# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')

# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
  service_id=$(echo "$check" | jq -r '.ServiceID')
  node=$(echo "$check" | jq -r '.Node')

  if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
    echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
    curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
  else
    echo "Skipping check with no associated service instance or node"
  fi
done

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

Apr 13 '23 11:04 CarelvanHeerden

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

1.5.3 have the same bug

Apr 13 '23 21:04 suikast42

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

Apr 16 '23 11:04 cmdrrobin

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.
{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

I have no acl enabled. I don't think that this issue belongs to the acl system.

Apr 16 '23 12:04 suikast42

Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul 1.15.2 and Nomad 1.5.3...

Apr 18 '23 12:04 bfqrst

I had this issue with nomad service provider without ACLs, the context details in #16890.

ubuntu 22.04.2 LTS nomad 1.5.3 docker 23.0.3

Apr 19 '23 02:04 icyleaf

encountered this as well, I am able to reproduce quite reliably with the following sequence:

Restart Alloc
immediately after Stop Alloc

using nomad 1.4.7.

the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed

suspecting it could be related to https://github.com/hashicorp/nomad/issues/16289, but havent confirm

update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps

Apr 20 '23 21:04 fredwangwang

https://github.com/hashicorp/nomad/pull/16905 should contain a fix for this - I've checked with a simple alloc restart and job stop repro described by @fredwangwang, but if other folks want to build the branch and confirm that would be helpful.

Apr 24 '23 13:04 shoenig

@shoenig thanks for the fix!

Apr 24 '23 16:04 fredwangwang

Our issue is back again with Nomad version 1.5.5. Issue was re-introduce when we stop (and purged) a job and re-deployed the job. If more information is required please let me know.

Jun 05 '23 16:06 cmdrrobin

@rgruyters a fix went into 1.5.6

Jun 05 '23 16:06 shoenig

Reproduced this in version 1.6.1 with ACL ((

Aug 10 '23 21:08 dm-evstafiev

Reproduced this in version 1.6.1 with ACL ((

Oh no, sorry this is still happening to you.

Would you be able to provide some reproduction steps?

Aug 16 '23 18:08 lgfa29

https://github.com/hashicorp/nomad/issues/18203 reports a similar issue and with Nomad 1.6.1 as well, so there still seems to be a problem with services not being unregistered (either in Consul or Nomad). I'm going to reopen this one.

Aug 16 '23 20:08 lgfa29

I am running solo nomad with ACL (no consul) and could repro this (See #18203 as mentioned previously)

Aug 18 '23 07:08 blmhemu

Running into this issue as well. CONSULVERSION=1.16.1 and NOMADVERSION=1.6.1

Aug 21 '23 18:08 jzingh98

nomad nomad copied to clipboard

Services not unregistered

nomad
nomad copied to clipboard