nomad icon indicating copy to clipboard operation
nomad copied to clipboard

No service registrations found but show in ui

Open chenjpu opened this issue 3 years ago • 1 comments

Nomad version

1.4.1

Issue

service ( nomad service info --namespace=ptl mimir )

No service registrations found

allocs( nomad alloc status -namespace=ptl df6b7ce3 )

ID                  = df6b7ce3-83a8-3e63-ff9f-ce27f84a3e89
Eval ID             = fe8ff5ca
Name                = mimir.prom[0]
Node ID             = addb6b9d
Node Name           = server2
Job ID              = mimir
Job Version         = 28
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 30m23s ago
Modified            = 29m6s ago
Deployment ID       = 5e6034dd
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*http  yes      172.17.22.151:30111 -> 9009

Nomad Service Checks:
Service  Task     Name                    Mode         Status
mimir    (group)  service: "mimir" check  healthiness  success

Task "prom" is "running"
Task Resources:
CPU        Memory          Disk     Addresses
5/500 MHz  64 MiB/500 MiB  300 MiB

CSI Volumes:
ID   Read Only
art  false

Task Events:
Started At     = 2022-10-21T08:12:31Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2022-10-21T16:12:30+08:00

Recent Events:
Time                       Type                   Description
2022-10-21T16:12:31+08:00  Started                Task started by client
2022-10-21T16:12:30+08:00  Restarting             Task restarting in 6.156059778s
2022-10-21T16:12:30+08:00  Terminated             Exit Code: 0
2022-10-21T16:12:30+08:00  Restart Signaled       healthcheck: check "service: \"mimir\" check" unhealthy
2022-10-21T16:12:07+08:00  Started                Task started by client
2022-10-21T16:12:06+08:00  Downloading Artifacts  Client is downloading artifacts
2022-10-21T16:12:06+08:00  Task Setup             Building Task Directory
2022-10-21T16:11:53+08:00  Received               Task received by client

UI

image

chenjpu avatar Oct 25 '22 00:10 chenjpu

Hi @chenjpu thanks for reporting. Seems like this may have something to do with the task having failed on initial startup and only succeeding after a restart; I'll try and see if I can re-create the problem.

As for the UI, it's looking up the check status of the service (stored on the client), which are different from the service registration itself (stored on the server).

shoenig avatar Oct 25 '22 13:10 shoenig

Is there any progress? The service drop is more serious

chenjpu avatar Dec 09 '22 05:12 chenjpu

Hi @chenjpu I did spend some time looking into this but so far have not been able to reproduce the problem. Is this bug something you are experiencing regularly? Do you happen to have a way to reproduce it?

shoenig avatar Dec 09 '22 14:12 shoenig

I tried to remove the initial_status = "warning" of the check configuration, and then it is back to normal. I wonder if it is related to this setting?

chenjpu avatar Dec 10 '22 00:12 chenjpu

After this configuration is removed, when the service monitoring check failure service resumes, the registry still does not display the service.:):)

And provider=consul services are normal, currently can only be first back to the consul registry to solve the problem

chenjpu avatar Dec 12 '22 00:12 chenjpu

Hi @chenjpu sorry for the slow followup - I was finally able to reproduce this issue using the job below. Still need to find the root cause but it is looking like this happens to services that fail their initial healthcheck until after a task restart.

job "bug" {
  datacenters = ["dc1"]

  group "group" {
    network {
      mode = "host"
      port "http" {
        static = 8888
      }
    }
    
    service {
      provider = "nomad"
      name = "web"
      port = "http"
      check {
        type = "http"
        path = "/hi.txt"
        interval = "3s"
        timeout = "1s"
        check_restart {
          limit = 3
          grace = "5s"
        }
      }
    }

    task "py" {
      driver = "raw_exec"

      config {
        command = "python3"
        args = ["-m", "http.server", "8888", "--directory", "/tmp"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

(Then just touch /tmp/hi.txt after the task is restarted during the deployment)

shoenig avatar Jan 24 '23 16:01 shoenig

Looks like this is a logical race condition caused by the group service hook PreTaskRestart method, where we de-register and re-register checks in quick succession. The problem is that check de-registration occurs asynchronously - which races with the re-registration attempt shortly after. Under normal circumstances letting the de-registration happen in the background is fine, because it doesn't impact anything. But in this case we need to know the de-registration is complete before doing the re-registration. Can probably just bubble up a Future to optionally block on.

shoenig avatar Jan 24 '23 19:01 shoenig