nomad
nomad copied to clipboard
Services from failed services isn't deregistred/blinking
Nomad version
nomad 1.1.3, consul v1.10.1
Operating system and Environment details
Debian GNU/Linux 10, 3 servers, all running consul/nomad server and client.
Issue
I'm playing with a simple job to see how nomad handles failures (with https://github.com/ThomasObenaus/dummy-services/tree/master/fail_service). Cluster contains 3 servers, but the job is constrained to run on one specific server, 1 instance. Nomad handles failures as expected via restarting the job. But what I noticed that services from consul isn't deregistered and grow over time, see screenshot:
Reproduction steps
Run job provided.
Expected Result
Services should be removed when allocation stopped.
Actual Result
List of services keep grooving as new allocations failed and rescheduled.
Job file (if appropriate)
job "fail" {
datacenters = ["dc"]
constraint {
attribute = "${attr.unique.consul.name}"
operator = "regexp"
value = "^(hjk)$"
}
update {
healthy_deadline = "30s"
progress_deadline = "40s"
min_healthy_time = "0s"
}
group "fail" {
count = "1"
network {
port "http" {
to = 8080
}
}
service {
port = "http"
check {
port = "http"
type = "http"
path = "/health"
method = "GET"
interval = "10s"
timeout = "2s"
check_restart {
limit = 1
}
}
}
task "fail" {
driver = "docker"
config {
image = "thobe/fail_service:latest"
ports = ["http"]
}
env {
HEALTHY_FOR = 60
UNHEALTHY_FOR = -1
}
}
}
}
nomad status output
ID = fail
Name = fail
Submit Date = 2021-08-15T13:30:44Z
Type = service
Priority = 50
Datacenters = dc
Namespace = default
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
fail 0 0 0 5 0 0
Future Rescheduling Attempts
Task Group Eval ID Eval Time
fail a817bf67 5m48s from now
Latest Deployment
ID = 43272639
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
fail 1 1 1 0 2021-08-15T13:31:48Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
3f7d5463 f4ea2330 fail 0 run failed 2m57s ago 2m12s ago
b16bd973 f4ea2330 fail 0 stop failed 8m52s ago 2m57s ago
f2f40eb4 f4ea2330 fail 0 stop failed 11m35s ago 8m52s ago
91f10e18 f4ea2330 fail 0 stop failed 13m17s ago 11m35s ago
260fa4d0 f4ea2330 fail 0 stop failed 14m29s ago 13m17s ago
Thanks for the report @valodzka.
I suspect this may be related to the v1.1.0 check readiness changes. In my reproductions I noticed that the services would not be removed when the allocation failed, but it would eventually get garbage collected.
Could you test this on Nomad v1.0.0 and see if the problem still happens?
@lgfa29 I tested with nomad 1.0.6 and saw the same result, services added, failed, removed, added again... Here short video how it looks:
https://user-images.githubusercontent.com/9821/130198419-f3a33f9d-d01c-4ce0-ba73-d4d8c77d7742.mp4
Hum....interesting, that's not what I was seeing.
Did you just run the job and waited while it failed a few times?
@lgfa29
This it how it looks like after running few hours (cannot record all because they don't fit into screen).
https://user-images.githubusercontent.com/9821/130252492-2f7177fb-4ca9-4f99-8d87-0ec636a34e23.mp4
In case it's matter here settings for nomad:
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
datacenter = "...."
name = "..."
advertise {
http = "..."
rpc = "..."
serf = "..."
}
server {
enabled = true
bootstrap_expect = 3
server_join {
start_join = ["...", "..."]
}
default_scheduler_config {
}
# less frequent gc
job_gc_interval = "1h"
job_gc_threshold = "48h"
eval_gc_threshold = "12h"
deployment_gc_threshold = "48h"
heartbeat_grace = "300s"
min_heartbeat_ttl = "30s"
max_heartbeats_per_second = 10.0
}
client {
enabled = true
servers = ["...:4647", "...:4647", "...:4647"]
reserved {
memory = 922
}
}
plugin "docker" {
config {
volumes {
enabled = true
}
allow_caps = ["all"]
}
}
I'll try to run with all defaults to see if it will make any difference.
Same result with almost default settings (after ~20 minutes running):
https://user-images.githubusercontent.com/9821/130258807-de757f55-42f7-4f42-9ec6-14f6c519590b.mp4
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
datacenter = "..."
name = "..."
advertise {
http = "..."
rpc = "..."
serf = "..."
}
server {
enabled = true
bootstrap_expect = 3
}
client {
enabled = true
servers = ["...:4647", "...:4647", "...:4647"]
}
plugin "docker" {
config {
volumes {
enabled = true
}
allow_caps = ["all"]
}
}
After reading documentation https://www.nomadproject.io/docs/job-specification/service I decited to check if the place of service makes any difference.
Registration: Nomad will register group services and checks before starting any tasks. Services and checks for a specific task are registered after the task has started.
It turned out it matter. Bug appears only if service is declared on group level, nomad works well if service is declared on task level.
Thanks for the update @valodzka.
Unfortunately I'm still not able to replicate this problem 🤔
Would you be able to provide some Nomad and Consul logs that may show the service being registered and deregistered?
Thanks!
@lgfa29 sure, here for both consul and nomad https://gist.github.com/valodzka/7a604ac900841ddd51ffe17f628c4af6
If you need anything more like debug level I can do it too.
Consul UI at same time:
Here nomad and consul logs with debug level. I renamed job/group/task to debug and filtered consul log lines related to other services.
https://gist.github.com/valodzka/1ce5aa32a033d5ddde28ee40a444cb8c
Hi @lgfa29, if this bug isn't reproducible on your infrastructure could you please try with vagrant as described below?
I tried to reproduce this bug using vagrant and the some base cluster setup template https://github.com/discoposse/nomad-vagrant-lab as base vagrant nomad cluster configuration. Turns out it's easily reproducible without any specific configuration (I only updated ubuntu/nomad/consul versions).
Steps to reproduce:
git clone [email protected]:valodzka/nomad-vagrant-lab.git
cd nomad-vagrant-lab
vagrant up
vagrant ssh nomad-a-1
bash /vagrant/launch-a-1.sh
# repeat ssh/launch it a-2, a-3
nomad run test.nomad
Open link http://172.16.1.101:8500/ui/toronto/services/fail-fail-failed/instances & wait 10-20-30 minutes. Results are on screenshot:
Hi @lgfa29,
I understand that you probably have more important tasks to do than this service deregistraion bug, please simply confirm that you saw that this issue has a reproducer and the notification isn't simply lost.
Hi @valodzka, apologies for the delay here. I have not had the chance to try the reproduction yet, but than you so much for the extra work you've put into it.
I'm aiming to get back into this next week, so I will hopefully have more information for you then.
Looks like this finally was fixed c56a4bc5e6022e8aadf10ebb93170f90fecffc43. Thanks to guys from yandex #15536
Glad to hear it. Thanks for the update @valodzka!