nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Services from failed services isn't deregistred/blinking

Open valodzka opened this issue 3 years ago • 12 comments

Nomad version

nomad 1.1.3, consul v1.10.1

Operating system and Environment details

Debian GNU/Linux 10, 3 servers, all running consul/nomad server and client.

Issue

I'm playing with a simple job to see how nomad handles failures (with https://github.com/ThomasObenaus/dummy-services/tree/master/fail_service). Cluster contains 3 servers, but the job is constrained to run on one specific server, 1 instance. Nomad handles failures as expected via restarting the job. But what I noticed that services from consul isn't deregistered and grow over time, see screenshot:

image

Reproduction steps

Run job provided.

Expected Result

Services should be removed when allocation stopped.

Actual Result

List of services keep grooving as new allocations failed and rescheduled.

Job file (if appropriate)

job "fail" {
  datacenters = ["dc"]

  constraint {
    attribute = "${attr.unique.consul.name}"
    operator = "regexp"
    value = "^(hjk)$"
  }

  update {
    healthy_deadline = "30s"
    progress_deadline = "40s"
    min_healthy_time = "0s"
  }


  group "fail" {
    count = "1"

    network {
      port "http" {
        to = 8080
      }
    }
    service {
      port = "http"
      check {
        port     = "http"
        type     = "http"
        path     = "/health"
        method   = "GET"
        interval = "10s"
        timeout  = "2s"
        check_restart {
          limit = 1
        }
      }
    }

    task "fail" {
      driver = "docker"
      config {
        image = "thobe/fail_service:latest"
        ports = ["http"]
      }


      env {
        HEALTHY_FOR   = 60
        UNHEALTHY_FOR = -1
      }
    }
  }
}

nomad status output

ID            = fail
Name          = fail
Submit Date   = 2021-08-15T13:30:44Z
Type          = service
Priority      = 50
Datacenters   = dc
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
fail        0       0         0        5       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
fail        a817bf67  5m48s from now

Latest Deployment
ID          = 43272639
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
fail        1        1       1        0          2021-08-15T13:31:48Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created     Modified
3f7d5463  f4ea2330  fail        0        run      failed  2m57s ago   2m12s ago
b16bd973  f4ea2330  fail        0        stop     failed  8m52s ago   2m57s ago
f2f40eb4  f4ea2330  fail        0        stop     failed  11m35s ago  8m52s ago
91f10e18  f4ea2330  fail        0        stop     failed  13m17s ago  11m35s ago
260fa4d0  f4ea2330  fail        0        stop     failed  14m29s ago  13m17s ago

valodzka avatar Aug 15 '21 13:08 valodzka

Thanks for the report @valodzka.

I suspect this may be related to the v1.1.0 check readiness changes. In my reproductions I noticed that the services would not be removed when the allocation failed, but it would eventually get garbage collected.

Could you test this on Nomad v1.0.0 and see if the problem still happens?

lgfa29 avatar Aug 19 '21 14:08 lgfa29

@lgfa29 I tested with nomad 1.0.6 and saw the same result, services added, failed, removed, added again... Here short video how it looks:

https://user-images.githubusercontent.com/9821/130198419-f3a33f9d-d01c-4ce0-ba73-d4d8c77d7742.mp4

valodzka avatar Aug 20 '21 07:08 valodzka

Hum....interesting, that's not what I was seeing.

Did you just run the job and waited while it failed a few times?

lgfa29 avatar Aug 20 '21 13:08 lgfa29

@lgfa29

This it how it looks like after running few hours (cannot record all because they don't fit into screen).

https://user-images.githubusercontent.com/9821/130252492-2f7177fb-4ca9-4f99-8d87-0ec636a34e23.mp4

In case it's matter here settings for nomad:

data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

datacenter = "...."
name = "..."

advertise {
  http = "..."
  rpc = "..."
  serf = "..."
}

server {
  enabled = true
  bootstrap_expect = 3
  server_join {
    start_join = ["...", "..."]
  }
  default_scheduler_config {
  }

  # less frequent gc
  job_gc_interval = "1h"
  job_gc_threshold = "48h"
  eval_gc_threshold = "12h"
  deployment_gc_threshold = "48h"

  heartbeat_grace = "300s"
  min_heartbeat_ttl = "30s"
  max_heartbeats_per_second = 10.0
}

client {
  enabled = true
  servers = ["...:4647", "...:4647", "...:4647"]

  reserved {
    memory = 922
  }
}

plugin "docker" {
  config {
    volumes {
      enabled = true
    }
    allow_caps = ["all"]
  }
}

I'll try to run with all defaults to see if it will make any difference.

valodzka avatar Aug 20 '21 15:08 valodzka

Same result with almost default settings (after ~20 minutes running):

https://user-images.githubusercontent.com/9821/130258807-de757f55-42f7-4f42-9ec6-14f6c519590b.mp4

data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

datacenter = "..."
name = "..."

advertise {
  http = "..."
  rpc = "..."
  serf = "..."
}

server {
  enabled = true
  bootstrap_expect = 3
}

client {
  enabled = true
  servers = ["...:4647", "...:4647", "...:4647"]
}

plugin "docker" {
  config {
    volumes {
      enabled = true
    }
    allow_caps = ["all"]
  }
}

valodzka avatar Aug 20 '21 15:08 valodzka

After reading documentation https://www.nomadproject.io/docs/job-specification/service I decited to check if the place of service makes any difference.

Registration: Nomad will register group services and checks before starting any tasks. Services and checks for a specific task are registered after the task has started.

It turned out it matter. Bug appears only if service is declared on group level, nomad works well if service is declared on task level.

valodzka avatar Feb 04 '22 21:02 valodzka

Thanks for the update @valodzka.

Unfortunately I'm still not able to replicate this problem 🤔

Would you be able to provide some Nomad and Consul logs that may show the service being registered and deregistered?

Thanks!

lgfa29 avatar Feb 07 '22 19:02 lgfa29

@lgfa29 sure, here for both consul and nomad https://gist.github.com/valodzka/7a604ac900841ddd51ffe17f628c4af6

If you need anything more like debug level I can do it too.

Consul UI at same time: image

valodzka avatar Feb 08 '22 09:02 valodzka

Here nomad and consul logs with debug level. I renamed job/group/task to debug and filtered consul log lines related to other services.

https://gist.github.com/valodzka/1ce5aa32a033d5ddde28ee40a444cb8c

valodzka avatar Feb 08 '22 10:02 valodzka

Hi @lgfa29, if this bug isn't reproducible on your infrastructure could you please try with vagrant as described below?

I tried to reproduce this bug using vagrant and the some base cluster setup template https://github.com/discoposse/nomad-vagrant-lab as base vagrant nomad cluster configuration. Turns out it's easily reproducible without any specific configuration (I only updated ubuntu/nomad/consul versions).

Steps to reproduce:

git clone [email protected]:valodzka/nomad-vagrant-lab.git
cd nomad-vagrant-lab
vagrant up
vagrant ssh nomad-a-1    
bash /vagrant/launch-a-1.sh
# repeat ssh/launch it a-2, a-3
nomad run test.nomad

Open link http://172.16.1.101:8500/ui/toronto/services/fail-fail-failed/instances & wait 10-20-30 minutes. Results are on screenshot:

image

valodzka avatar Oct 15 '22 16:10 valodzka

Hi @lgfa29,

I understand that you probably have more important tasks to do than this service deregistraion bug, please simply confirm that you saw that this issue has a reproducer and the notification isn't simply lost.

valodzka avatar Oct 24 '22 13:10 valodzka

Hi @valodzka, apologies for the delay here. I have not had the chance to try the reproduction yet, but than you so much for the extra work you've put into it.

I'm aiming to get back into this next week, so I will hopefully have more information for you then.

lgfa29 avatar Oct 24 '22 23:10 lgfa29

Looks like this finally was fixed c56a4bc5e6022e8aadf10ebb93170f90fecffc43. Thanks to guys from yandex #15536

valodzka avatar Jan 20 '23 15:01 valodzka

Glad to hear it. Thanks for the update @valodzka!

tgross avatar Jan 20 '23 15:01 tgross