nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Services not unregistered

Open dani opened this issue 1 year ago • 42 comments

Just upgraded to Nomad 1.5.2. Since then, services are not always unregistered from Consul service catalog when they are shuted down / upgraded. So old services versions appear as failed, eg

image

Environment :

  • Nomad 1.5.2 (with ACL), using the prebuilt binary
  • Consul 1.15.1 (with ACL), using the pre-built binary
  • Alma Linux 8

Haven't found yet a pattern to reproduce it 100% of the time

dani avatar Mar 23 '23 08:03 dani

Hi @dani, do you have any logs from the clients that were running the allocations that had services that should be deregistered? If you do and can pass them along, I can take a look through them and see if I can identify anything useful. If you have any other useful information that would be great, in order to try and reproduce this.

jrasell avatar Mar 23 '23 08:03 jrasell

1.5.2 included https://github.com/hashicorp/nomad/pull/16289/files which was supposed to fix a bug where we would attempt to deregister services twice. The key difference is we now set a flag that the services have been deregistered after the PostRun() allocrunner hook is run, preventing further attempts at deregistration.

Thinking about it now and reading our own docs, it is unclear whether PostRun implies an alloc is terminal ... if it isn't, and the services get re-registered for the same allocation, they'll never be deregistered.

shoenig avatar Mar 23 '23 13:03 shoenig

I was just able to trigger it on my prometheus Job :

  • I have a version of prom running, with 2 instances, and 2 services registered on Consul
  • I change a config file, and run the job again, Nomad starts the rolling update
  • In the end, I still have 2 instances running, two working services, and two failed services (corresponding to the two previous ones)

Here's my system logs during this rolling update :

nomad_unreg.txt

dani avatar Mar 23 '23 14:03 dani

Hi @dani, I've not been able to reproduce this yet locally; are you able to share the jobspec, or a redacted version you are using and what exactly is being changed before you register the new version? Thanks.

jrasell avatar Mar 24 '23 12:03 jrasell

ok, this particular job file was quite big, I'll try to reproduce with a simpler one (but I'll first have to install 1.5.2 again, as I had to revert to 1.5.1 because this issue made my test cluster totaly unusable)

dani avatar Mar 24 '23 12:03 dani

I have seen the same issue, I've even reproduced it using the counter demo app. The issue only happens to me when ACL is enabled. nomad: 1.5.2-1 consul: 1.15.1-1 ubuntu: 20.04

deploy the demo app, add an additional tag, and re-deploy and you now have two instances registered in consul. exec into the downstream and curl $NOMAD_UPSTREAM_ADDR_servicename a number of times and you will see some return "connection reset by peer" as consul is now returning services that are "completed" in nomad.

hope this helps.

I have also reverted my lab to 1.5.1-1

martdah avatar Mar 24 '23 16:03 martdah

I had the same problem and the way the NSD had the same problem

chenjpu avatar Mar 24 '23 23:03 chenjpu

Hi, we had the same problem after upgrade Nomad from 1.4.5 to 1.4.7 and restart Consul agents on nodes. It seams to only affect services in service mesh in Consul. After the upgrade, Nomad services using the Connect Stanza with proxied upstreams, showed old versions of the deployment failing in Consul (v1.14.4).

Reverted to Nomad 1.4.5.

Regards.

ngcmac avatar Mar 25 '23 11:03 ngcmac

Hi everyone and thanks for the information and additional context. We have been able to reproduce this locally and have some useful information to start investigating, so will update here once we have anything more.

jrasell avatar Mar 27 '23 14:03 jrasell

Additional repro that I've closed as a dupe, but just in case there's anything useful in the logs: https://github.com/hashicorp/nomad/issues/16739

tgross avatar Mar 31 '23 17:03 tgross

Hi everyone, we are continuing to look into this and while we were able to reproduce it in a manner, I wanted to gather some more information.

Those that have experienced this, are you setting the Consul agent ACL token via the consul acl set-agent-token command, the API equivalent, or via the agent config? This is a requirement in Consul v1.15.0 and later.

It seems to specifically affect Nomad v1.5.2, v1.4.7, and v1.3.12. If you do set the above token, are you able to provide context on the deployment that has the problem?

jrasell avatar Apr 03 '23 13:04 jrasell

In my case, I set the token in the config file, like

acl {
  enabled = true
  enable_token_persistence = true
  default_policy = "deny"
  tokens {
    default = "XXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
  }
}

Is this unsupported now ? (it's easier to set it in the config when deployed with tools like ansible)

dani avatar Apr 03 '23 14:04 dani

We are also setting it via consul config:

{
  "acl": {
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    },
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true
  }
}

ngcmac avatar Apr 05 '23 08:04 ngcmac

This issue still present in nomad v 1.5.3

suikast42 avatar Apr 05 '23 20:04 suikast42

An observation from my side.

I created this bash script to go clean up the services that were not unregistered in consul as a interim solution.

#!/bin/bash

CONSUL_HTTP_ADDR="http://consul.service.consul:8500/"
CONSUL_TOKEN=XXXX

# Get all unhealthy checks
unhealthy_checks=$(curl -s --header "X-Consul-Token: ${CONSUL_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/health/state/critical" | jq -c '.[]')

# Iterate over the unhealthy checks and deregister the associated service instances
echo "$unhealthy_checks" | while read -r check; do
  service_id=$(echo "$check" | jq -r '.ServiceID')
  node=$(echo "$check" | jq -r '.Node')

  if [ "$service_id" != "null" ] && [ "$node" != "null" ]; then
    echo "Deregistering unhealthy service instance: ${service_id} on node ${node}"
    curl --header "X-Consul-Token: ${CONSUL_TOKEN}" -X PUT "${CONSUL_HTTP_ADDR}/v1/catalog/deregister" -d "{\"Node\": \"${node}\", \"ServiceID\": \"${service_id}\"}"
  else
    echo "Skipping check with no associated service instance or node"
  fi
done

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

CarelvanHeerden avatar Apr 13 '23 11:04 CarelvanHeerden

This works and the service instances that are "dead" are removed from Consul UI, but only for a few seconds. They reappear in the UI moments later. I cannot confirm this, but I suspect Nomad is re-registering them.

Indeed. The issue does not belongs to consul. If you restart the nomad service then the dead services dissapears from nomad and consul.

This is still on Nomad 1.5.2. Upgrading to 1.5.3 now to test again.

1.5.3 have the same bug

suikast42 avatar Apr 13 '23 21:04 suikast42

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

rgruyters avatar Apr 16 '23 11:04 rgruyters

We have the same issue in our environment running Consul version 1.15.2 with Nomad version 1.5.2. FWIW, we are setting our agent acl tokens via config.

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "allow",
    "enabled": true,
    "tokens": {
      "agent": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    }
  },
}

I have no acl enabled. I don't think that this issue belongs to the acl system.

suikast42 avatar Apr 16 '23 12:04 suikast42

Not sure if related but I keep experiencing this without ACLs being turned on. It's hard to pinpoint, but from what I've seen it mostly happens when an ASG cycles the Nomad hosts and the job is being rescheduled on the new host. Combo is: Consul 1.15.2 and Nomad 1.5.3...

bfqrst avatar Apr 18 '23 12:04 bfqrst

I had this issue with nomad service provider without ACLs, the context details in #16890.

ubuntu 22.04.2 LTS nomad 1.5.3 docker 23.0.3

icyleaf avatar Apr 19 '23 02:04 icyleaf

encountered this as well, I am able to reproduce quite reliably with the following sequence:

  1. Restart Alloc image

  2. immediately after Stop Alloc image

using nomad 1.4.7.

image

the staled entries in consul are automatically cleaned up after restarting the nomad client where the allocation was placed

suspecting it could be related to https://github.com/hashicorp/nomad/issues/16289, but havent confirm

update: downgraded clients to 1.4.6, and does not (seem) to see this issue anymore using the above steps

fredwangwang avatar Apr 20 '23 21:04 fredwangwang

https://github.com/hashicorp/nomad/pull/16905 should contain a fix for this - I've checked with a simple alloc restart and job stop repro described by @fredwangwang, but if other folks want to build the branch and confirm that would be helpful.

shoenig avatar Apr 24 '23 13:04 shoenig

@shoenig thanks for the fix!

fredwangwang avatar Apr 24 '23 16:04 fredwangwang

Our issue is back again with Nomad version 1.5.5. Issue was re-introduce when we stop (and purged) a job and re-deployed the job. If more information is required please let me know.

rgruyters avatar Jun 05 '23 16:06 rgruyters

@rgruyters a fix went into 1.5.6

shoenig avatar Jun 05 '23 16:06 shoenig

Reproduced this in version 1.6.1 with ACL ((

dm-evstafiev avatar Aug 10 '23 21:08 dm-evstafiev

Reproduced this in version 1.6.1 with ACL ((

Oh no, sorry this is still happening to you.

Would you be able to provide some reproduction steps?

lgfa29 avatar Aug 16 '23 18:08 lgfa29

https://github.com/hashicorp/nomad/issues/18203 reports a similar issue and with Nomad 1.6.1 as well, so there still seems to be a problem with services not being unregistered (either in Consul or Nomad). I'm going to reopen this one.

lgfa29 avatar Aug 16 '23 20:08 lgfa29

I am running solo nomad with ACL (no consul) and could repro this (See #18203 as mentioned previously)

blmhemu avatar Aug 18 '23 07:08 blmhemu

Running into this issue as well. CONSULVERSION=1.16.1 and NOMADVERSION=1.6.1

jzingh98 avatar Aug 21 '23 18:08 jzingh98