nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Nomad Service Discovery unable to find service

Open vincenthuynh opened this issue 2 years ago • 11 comments

Nomad version

Nomad v1.4.7

Operating system and Environment details

Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux

Issue

Allocation is unable to find Nomad service when it exists. It seems to start happening on a client after an uptime of 2-3 days.

Reproduction steps

  • Task 1: register a service myservice using the Nomad provider
  • Task 2: use template stanza and NomadService function to reference the service that was registered in Task 1

Able to list service:

$ nomad service list -namespace="*"
Service Name               Namespace  Tags
myservice  default    []

Expected Result

Able to discover a service consistently

Actual Result

Task log:

Template | Missing: nomad.service(myservice)

Job file (if appropriate)

Task 1:

    service {
      provider = "nomad"
      name     = "myservice"
      port     = "redis"
    }

Task 2:

      template {
        data = <<EOH
{{range nomadService "myservice"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

Nomad Client logs

2023-04-25T16:10:02.354Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 5 after "4s")
2023-04-25T16:10:06.355Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 6 after "8s")
2023-04-25T16:10:14.356Z [WARN]  agent: (view) nomad.service(myservice): Get "http://127.0.0.1:4646/v1/service/myservice?namespace=default&stale=&wait=60000ms": closed (retry attempt 7 after "16s")

vincenthuynh avatar Apr 25 '23 16:04 vincenthuynh

The workaround is to restart the Nomad service/agent on the client node.

vincenthuynh avatar Apr 25 '23 16:04 vincenthuynh

Hi @vincenthuynh so far I haven't been able to reproduce what you're seeing - in my cases the template is always successfully rendered once the upstream task is started and its serivce is registered. Before I dig in further, could you post a complete job file you're using that experiences the issue? I want to make sure we're not missing something (e.g. using group vs. task services, etc.)

the test job file i've been using

bug.hcl
job "bug" {

  group "group" {
    network {
      port "http" {
        to = 8080
      }
    }

    task "python" {
      driver = "raw_exec"

      config {
        command = "python3"
        args = ["-m", "http.server", "8080"]
      }

      service {
        provider = "nomad"
        name = "python"
        port = "http"
      }

      resources {
        cpu    = 10
        memory = 32
      }
    }

    task "client" {
      driver = "raw_exec"

      template {
        data = <<EOH
{{range nomadService "python"}}
blah.host: {{ .Address }}
blah.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
      }

      config {
        command = "sleep"
        args = ["infinity"]
      }
    }
  }
}

shoenig avatar Apr 27 '23 16:04 shoenig

Hi @shoenig,

We've noticed that it takes a few days (2-3 days) before it starts happening.

Here's another reproduction:

  • An old allocation was stopped and a new one was created and it happened to be on the same node: image
  • It's unable to find the service: image
  • Applying the workaround: Simply restarting the nomad service on the client allows the task to discover the service again and start successfully.

Here's our job file:

myservice.hcl
job "myservice" {

  group "myservice" {
    network {
      mode = "bridge"
    }

    service {
      name = "myservice"
      port = "8080"
      tags = [
        "env=${var.env}",
        "version=${var.version}"]
      connect {
        sidecar_service {}
      }
    }

    task "myservice" {
      driver = "docker"
      leader = true
      config {
        image = "gcr.io/myservice"
        work_dir = "/local"
      }

      vault {
        policies = ["myservice"]
        change_mode = "noop"
      }

      template {
        data = <<EOH
{{range nomadService "myservice-cache-redis"}}
spring.redis.host: {{ .Address }}
spring.redis.port: {{ .Port }}
{{end}}
EOH
        destination = "local/config/application.yml"
        change_mode = "noop"
      }
    }
  }
    
  group "myservice-cache" {
    network {
      mode = "bridge"
      port "redis" {
        to = 6379
      }
    }

    service {
      provider = "nomad"
      name = "myservice-cache-redis"
      port = "redis"
    }

    task "redis" {
      driver = "docker"
      config {
        image = "redis:6.2.0"
        args  = ["redis-server", "/local/redis.conf"]
      }
      template {
        data = <<-EOH
maxmemory 250mb
maxmemory-policy allkeys-lru
EOH
        destination = "/local/redis.conf"
        change_mode = "noop"
      }
      resources {
        cpu    = 100
        memory = 256
      }
    }
  }
}

Hope that helps. Thanks!

vincenthuynh avatar Apr 27 '23 18:04 vincenthuynh

I encountered a similar issue caused by having NOMAD_ADDR set in the environment that nomad agent was run in. That var apparently went through to the Nomad API client that consul-template uses, and caused it to fail its API calls (in my case, for HTTP vs. HTTPS reasons) for the nomadService lookup.

My errors happened very consistently, so different I think from this case, but wanted to mention here for anyone else who finds this issue like I did. My solution was to ensure NOMAD_ADDR is not set in my nomad agent environment.

gulducat avatar Jun 07 '23 18:06 gulducat

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

image

Restarting the allocation seems to resolve the issue and force Nomad to re-register the service.

Unfortunately, this time it was my log aggregator that disappeared, so I don't have an easy way to pull logs from around the time of the issue. I'll try to grab them the next time it happens to a different service.

IamTheFij avatar Jun 20 '23 15:06 IamTheFij

This issue still consistently happens for us every 2-3 days. I can observe exactly the same as @IamTheFij however we run nomad 1.6.3.

This is happening occasionally to me as well (Nomad 1.5.3). It doesn't seem to be consistent as to which service or which host the service disappears from.

To add another odd detail rather than just bumping, the service shows up in the UI, however it does not show on any of the nodes via the CLI.

tfritzenwallner-private avatar Nov 22 '23 09:11 tfritzenwallner-private

I have observed same issue for nomad 1.7.3 to 1.7.7

mikedvinci90 avatar May 05 '24 12:05 mikedvinci90

I observer the same problem in v1.8.1

benbourner avatar Jun 23 '24 07:06 benbourner

seeing same problem, services are clearly visible on Nomad UI, but cannot be used by templating.

Nomad 1.7.7 (multi-region, multi-dc and ACL enabled)

(Consul-based service templating works fine and reliable, as opposed to Nomad-based services)

dmclf avatar Jun 24 '24 08:06 dmclf

Yet another example in nomad 1.8.1, it's just happening randomly among my services. Because i have traefik parsing the nomad services, they just disappear from traefik and are thus inaccessible. After rock-solid running for years, now the nomad deployments are just unreliable... :(

Service are up, healthy and reachable on the given ports... image

But the service allocations have again disappeared so traefik no longer sees them, so I can't access them via proper URLs... image

benbourner avatar Jun 27 '24 16:06 benbourner

I've seen this occuring frequently under poor network conditions where

  1. the client misses a heartbeat to the server
  2. the server unregisters the services managed by this client in response
  3. the client responds to the next heartbeat
  4. but the server does not seem to register the services back

I don't know if this is intended and if it is the same issue people are having here.

As a workaround, I was restarting the Nomad agent on the client every 20 mins. (I didn't need HA)

dannyhpy avatar Jun 27 '24 20:06 dannyhpy

I had the same problem in v1.9.4

devops-eita-nl avatar Jan 31 '25 00:01 devops-eita-nl

Hi folks! This fell thru the cracks of issue triage somehow. I've tagged it for roadmapping.

(Internal ref: https://hashicorp.atlassian.net/browse/NMD-516)

tgross avatar Feb 10 '25 21:02 tgross

Hi folks! This fell thru the cracks of issue triage somehow. I've tagged it for roadmapping.

(Internal ref: https://hashicorp.atlassian.net/browse/NET-12138)

Hello,

Is this now on the roadmap? Is there a timeline associated with the roadmap that you can share?

Thanks!

klong1-godaddy avatar Jul 21 '25 17:07 klong1-godaddy

Hi @klong1-godaddy; it is currently on the backlog but we do not currently have a timeline associated to it. I have mentioned this internally to the team, so they are more aware of it. One note from some initial internal investigation is that we are struggling to reproduce it, so if anyone has a reproducible reproduction path it would greatly help. Thanks.

jrasell avatar Jul 22 '25 07:07 jrasell

Hi folks, I've done some investigation and it looks like there's a couple different problems going on here. I don't yet have a solution to the core problem, but I've got some working hypotheses that I'll continue pressing on.

tl;dr What you can do to help

If you can get debug-level server logs from the leader, and/or client logs from a host where a service should be running around the time that it happens, that would be super helpful. I'm especially looking for logs about the service registration or missed heartbeats. And I'm also looking for any information you can provide about the history of allocations that provide the missing service on that client. (i.e. their Task Events with timestamps).

Symptoms

The symptoms we've seen reported here are:

  1. traefik can't find services even though they appear in the web UI
  2. The template block can't find services even though they appear in the web UI
  3. The template block can't find services even though they appear in the CLI
  4. CLI can't find services even though they appear in the web UI
  5. Services are deregistered when a node misses heartbeats and aren't restored.
  6. "Restarting the allocation seems to resolve the issue and force Nomad to re-register the service."
  7. "Simply restarting the nomad service on the client allows the task to discover the service again and start successfully."
  8. "Service are up, healthy and reachable on the given ports..."
  9. Things are working, sometimes for quite a while, and then they're not.
  10. In one case we see that "An old allocation was stopped and a new one was created and it happened to be on the same node".

These mostly boil down into two problems: there are discrepancies in the web UI, and the more fundamental problem that services are going missing.

Web UI Discrepancy (Solved!)

The web UI represents services as belonging to a job, not to the namespace more generally, so it's hitting /v1/job/:jobName/services. When I tested that by downing a node, I discovered that the web UI doesn't clean up the services table on /ui/jobs/:jobName@:namespace/services when services are removed, because at that point it's detected them and uses the API query to track the number of instances of the service. We can see in https://github.com/hashicorp/nomad/issues/16983#issuecomment-2195196447 the "Number of Allocations" column being 0. So that's a surprising bit of UX perhaps, but not strictly speaking incorrect.

Example API responses

Node is up:

$ nomad operator api "/v1/services" | jq .
[
  {
    "Namespace": "default",
    "Services": [
       {
        "ServiceName": "httpd-web",
        "Tags": []
      }
    ]
  }
]

$ nomad operator api "/v1/service/httpd-web" | jq .
[
  {
    "Address": "192.168.1.194",
    "AllocID": "a7927dd0-5c97-0c95-6fd2-5b0db9b82ac5",
    "CreateIndex": 85,
    "Datacenter": "dc1",
    "ID": "_nomad-task-a7927dd0-5c97-0c95-6fd2-5b0db9b82ac5-group-web-httpd-web-www",
    "JobID": "httpd",
    "ModifyIndex": 85,
    "Namespace": "default",
    "NodeID": "00548eeb-61c4-adbd-a105-aabf387cd4ba",
    "Port": 27369,
    "ServiceName": "httpd-web",
    "Tags": []
  }
]

$ nomad operator api "/v1/job/httpd/services" | jq .
[
  {
    "Address": "192.168.1.194",
    "AllocID": "a7927dd0-5c97-0c95-6fd2-5b0db9b82ac5",
    "CreateIndex": 85,
    "Datacenter": "dc1",
    "ID": "_nomad-task-a7927dd0-5c97-0c95-6fd2-5b0db9b82ac5-group-web-httpd-web-www",
    "JobID": "httpd",
    "ModifyIndex": 85,
    "Namespace": "default",
    "NodeID": "00548eeb-61c4-adbd-a105-aabf387cd4ba",
    "Port": 27369,
    "ServiceName": "httpd-web",
    "Tags": []
  }
]

Bring the node down, and we see empty API responses but the web UI table is still there.

$ nomad operator api "/v1/service/httpd-web" | jq .
[]

$ nomad operator api "/v1/services" | jq .
[]

$ nomad operator api "/v1/job/httpd/services" | jq .
[]

We should probably have a separate Services page to make this relationship more clear.

But that only solves the weird discrepancy between the web UI and the API, it doesn't explain the more important problem of why the services are missing even though the job has live allocations.

Missing services

I've investigated a few dead ends:

  • Maybe a node goes down we don't mark system jobs down and recreate their allocs, so there's no way to detect the alloc is live again to re-register its services. But I've verified in testing that 's not how it works -- system alloc are marked lost and replaced. And in any case the the registered services folks have reported here are mostly (all?) from service jobs.
  • Maybe when a node is marked down/disconnected we miss service health checks? But Nomad service health checks don't heartbeat, so we can't update their health while disconnected. And none of the registered services folks have reported here include health checks.
  • Maybe an index isn't getting updated, so blocking queries aren't unblocking. But new queries would detect the updated state because they wouldn't have a wait index.

I've got two promising hypotheses though:

Hypothesis 1

Maybe there's a race condition between when one allocation for a given service stops and another for the same service starts on the same node. In that case, we could potentially interleave start and stop RPCs to the server.

The Consul provider has a complicated sync process, where the provider (the "service client") batches up register/deregister operations and then syncs them as needed. There's also anti-entropy here so that we track which services we expect to have and which ones we don't, and make sure the Consul agent has them.

The Nomad provider doesn't have this kind of logic. Instead, it registers/deregisters on-demand as soon as the workload hits the appropriate hooks. However, those deregistration requests are for a specific ID, which a compound ID that includes the allocation ID. So one allocation can't be deregistering a service used by another unless the RPC handler screws that up or we give one RemoveWorkload call the alloc ID for another workload.

Hypothesis 2

One way that we know that a service can be deregistered is when a node misses a heartbeat. In that case, the server marks it down (or disconnected if that's enabled). When the node status is down (but not disconnected) we deregister all services on the node (ref state_store.go#L1153-L1156), because all the workloads should be rescheduled.

  • If the node reconnects and heartbeats again, all its workloads will have been stopped at the server, and so we don't expect the specific service instance on that node to come back unless we place new allocations there.
  • However, we don't actually reschedule all workloads because jobs of type="system" don't get rescheduled. Those system job allocs should still be marked for stop and then recreated when the node comes back up. Again, that's a new service instance and not the existing one.
  • If a node is marked "disconnected" and not down, we won't have removed the services.

The other thing we understand is that restarting the Nomad client agent will cause it to "restore" the allocrunner that it uses to monitor the allocations. That hits the Prerun method in the group service hook and forces re-registration and restarts the check watchers (if any). So it makes sense that would "fix" the problem if there's any live allocation that's missing a service.

It could be possible for a node to be late for heartbeats, be marked down and have its services removed, only to immediately heartbeat and reconnect before the resulting evaluations are processed. We can simulate this by manipulating the eval broker and the client process:

$ nomad operator scheduler set-config -pause-eval-broker=true
Scheduler configuration updated!

# 81056 is nomad client PID
$ sudo $(which dlv) attach 81056
Type 'help' for list of commands.
(dlv)

Wait for the node to go down. Exit dlv without killing the process. Then wait for the node to become ready. The service will no longer appear:

$ nomad service info -json httpd-web
Error listing service registrations: No service registrations with prefix "httpd-web" found

But if we then restart the client, the service comes back up:

$ nomad service info -json httpd-web
[
    {
        "Address": "192.168.1.194",
        "AllocID": "8c876b11-e89a-0bd8-f86b-8156debfb39b",
        "CreateIndex": 222,
        "Datacenter": "dc1",
        "ID": "_nomad-task-8c876b11-e89a-0bd8-f86b-8156debfb39b-group-web-httpd-web-www",
        "JobID": "httpd",
        "ModifyIndex": 222,
        "Namespace": "default",
        "NodeID": "00548eeb-61c4-adbd-a105-aabf387cd4ba",
        "Port": 20994,
        "ServiceName": "httpd-web",
        "Tags": []
    }
]

So that definitely could be what's happening, but without more data around that possibility we can't prove that's the problem at hand. This also feels like a narrow race unless the cluster's eval broker is very backed up.

tgross avatar Jul 31 '25 19:07 tgross

I've got a fix for "hypothesis 2" over in https://github.com/hashicorp/nomad/pull/26424. To test this I ran a job with a Nomad service ("httpd").

$ nomad service info -json httpd
[
    {
        "Address": "192.168.1.194",
        "AllocID": "258a6e39-c6ce-3f13-d72f-911e6785e726",
        "CreateIndex": 28,
        "Datacenter": "dc1",
        "ID": "_nomad-task-258a6e39-c6ce-3f13-d72f-911e6785e726-group-web-httpd-www",
        "JobID": "example",
        "ModifyIndex": 28,
        "Namespace": "default",
        "NodeID": "63eaad27-002c-a5d0-c19b-f1a3f63354c4",
        "Port": 24411,
        "ServiceName": "httpd",
        "Tags": []
    }
]

First we'll demonstrate that marking the node down without an eval marking the allocs down no longer deletes the services. As above, we'll pause the eval broker and the node.

$ nomad operator scheduler set-config -pause-eval-broker=true
Scheduler configuration updated!

# 88981 is nomad client PID
$ sudo $(which dlv) attach 88981
Type 'help' for list of commands.
(dlv)

Wait for the node to go down. We'll see the allocations are left running as we'd expect, and the service still exists.

$ nomad node status
ID        Node Pool  DC   Name                                             Class  Drain  Eligibility  Status
63eaad27  default    dc1  li-4c4c4544-0036-3410-804a-b6c04f423534.ibm.com  local  false  eligible     down

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
258a6e39  63eaad27  web         0        run      running  4m3s ago  3m52s ago

$ nomad service info -json httpd | jq '.[0].ID'
"_nomad-task-258a6e39-c6ce-3f13-d72f-911e6785e726-group-web-httpd-www"

Unblock client and allow it to come back up, then unblock the scheduler. In this order of operations, the allocation is still running and the service is still up.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
258a6e39  63eaad27  web         0        run      running  5m40s ago  5m29s ago

$ nomad service info -json httpd | jq '.[0].ID'
"_nomad-task-258a6e39-c6ce-3f13-d72f-911e6785e726-group-web-httpd-www"

Stop the eval broker and pause the node again. Wait for it to go down and then this time we'll unpause the eval broker first. The allocation will be marked lost and the service will be deleted:

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
258a6e39  63eaad27  web         0        stop     lost    7m28s ago  5s ago

$ nomad service info -json httpd
Error listing service registrations: No service registrations with prefix "httpd" found

Then unpause the node so the blocked eval runs and we get a new alloc and a new copy of the service:

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
326b1f03  63eaad27  web         0        run      running   10s ago    9s ago
258a6e39  63eaad27  web         0        stop     complete  8m22s ago  5s ago

$ nomad service info -json httpd
[
    {
        "Address": "192.168.1.194",
        "AllocID": "326b1f03-9015-5937-5010-c14d32a0fa47",
        "CreateIndex": 64,
        "Datacenter": "dc1",
        "ID": "_nomad-task-326b1f03-9015-5937-5010-c14d32a0fa47-group-web-httpd-www",
        "JobID": "example",
        "ModifyIndex": 64,
        "Namespace": "default",
        "NodeID": "63eaad27-002c-a5d0-c19b-f1a3f63354c4",
        "Port": 26644,
        "ServiceName": "httpd",
        "Tags": []
    }
]

tgross avatar Aug 04 '25 19:08 tgross

https://github.com/hashicorp/nomad/pull/26424 has been merged. We're going to close this issue as we believe this should solve the problem described here. This will ship in the upcoming version of Nomad.

If you encounter this issue again with a cluster that has been fully upgraded to Nomad 1.10.4 (or Nomad Enterprise 1.9.12+ent or Nomad Enterprise 1.8.16+ent), then please report that here with the following:

  • Debug-level server logs from the leader
  • Client logs from a the host where a service should be running around the time that it disappears.
  • Any information you can provide about the history of allocations that provide the missing service on that client. (i.e. their Task Events with timestamps).

tgross avatar Aug 06 '25 17:08 tgross

It does seem somewhat likely for the eval broker to be backed up during the time of an event. Generally I expect service nomad nodes getting disconnected is going to be during something awful happening. For example during the last event we saw this there was networking issues, hence the node disconnect and I would expect issues such as this may also be heavy on the eval broker. Is there a way to monitor how backed up the eval broker is? That seems like it would be useful data to have, if it spikes at the same point we have services disappear.

Unfortunately as you have found replicating this problem on demand is not easy. If I ever end up with debug logs I can share I will, but it seems unlikely as the event itself does not happen often, I don't think we have seen it yet in our dev environment and we generally don't run around with debug logs enabled in any case.

klong1-godaddy avatar Aug 06 '25 23:08 klong1-godaddy

Is there a way to monitor how backed up the eval broker is?

Yes, you should be monitoring the various performance metrics described in the monitoring docs (see also the reference table). In particular the nomad.nomad.broker.* metrics will be useful here. Unless you're running heavy-throughput batch workloads you'll expect the unacked and pending evals should be very low.

tgross avatar Aug 07 '25 13:08 tgross