consul-template icon indicating copy to clipboard operation
consul-template copied to clipboard

consul-template randomly gets stuck

Open white-hat opened this issue 7 years ago • 7 comments

Consul Template version

consul-template v0.19.4 (68b1da2)

Configuration

consul {
  retry {
    enabled = true
    attempts = 10
    backoff = "250ms"
    max_backoff = "30s"
  }
}
reload_signal = "SIGHUP"
kill_signal = "SIGINT"
max_stale = "10m"
log_level = "warn"


### ipset-cloud.ctpl
template {
  source = "/etc/consul-templater/ipset_cloud.ctpl"
  destination = "/output/ipset/cloud.conf"
  error_on_missing_key = true
  wait {
    min = "5s"
    max = "10s"
  }
}

Template

create temp_ips_allowed_cloud hash:ip
{{ range nodes }}add temp_ips_allowed_cloud {{ .Address }}
{{ end }}
swap temp_ips_allowed_cloud ips_allowed_cloud
destroy temp_ips_allowed_cloud

Command

consul-template -consul-addr=127.0.0.1:8500 -config=consul-template.conf

Debug output

Not reproducible as it gets stuck in 2-3 weeks

Expected behavior

Not to get stuck

Actual behavior

file /output/ipset/cloud.conf should be updated at least once a minute. At some hosts after 1-2 weeks of uptime file stops being updated.

Steps to reproduce

  1. run consul-template with given configuration
  2. wait several days
  3. run stat /output/ipset/cloud.conf to verify modification date until it stops updating

References

did not find any

white-hat avatar Feb 16 '18 00:02 white-hat

Following. I believe I have seen this behavior in our cluster but did not chance to investigate/strace/etc. Should I find anything I will keep you posted.

As a work around our process supervisor (perp) is running consul-template with timeout -s SIGTERM $random_interval consul-template options...., so we basically we are restarting it once or twice a day. It is a hack but it gets the job done until the core problem is found.

vaLski avatar May 18 '18 08:05 vaLski

@white-hat, thanks for opening this issue and apologies for the delayed reply.

I looked into the issue and was unfortunately unable to reproduce it. Here's what I did:

  • Two weeks ago, I started up a consul cluster with two nodes. I also started running consul-template with the configuration and template you included
  • Each day, I removed or added a consul server agent in the cluster. I confirmed that the template updated accurately and ran stat to verify the modification date is correct
  • After 15 days, I saw that the template and stat were updating correctly on changes to the consul cluster

I'll mark this issue as irreproducible for now. If you have any feedback on my reproducing steps please let me know. If you or anyone has discovered any additional details on how to replicate this issue more consistently or any relevant errors in the logs, please feel free to share.

Thanks!

lornasong avatar May 05 '20 18:05 lornasong

I've been scouring the issue tracker for anything resembling my current issue, and this one seems to align pretty well.

I have a running haproxy nomad job, with its configuration managed by consul-template via a template stanza in the nomad job specification.

The template contains several {{range service "..."}} loops over a handful of different service types. One of these seems to be "stuck". When changes are made, the file is re-rendered, but this particular block's output does not reflect the service list reported by consul - whether adding or removing services.

I have tried running consul-template manually with the same template, and I get the correct output. I also have a second nomad client node where the same job is running, and that one is updating the configuration correctly.

This is currently a minor production issue for me, and I'm assuming (really, hoping) that a simple restart will fix it. But I'd also like to understand what has gone wrong here, so if there is any debug information I can provide before I hit the restart button some time today, do let me know.

Edit: I'm using Nomad v1.2.6, which I believe is bundling consul-template v0.25.2.

mwild1 avatar Jul 05 '22 13:07 mwild1

I'm not able to figure out a useful strace (I have a gazillion nomad processes on the system, and the main agent process seems to be blocked on futex()), so any suggestions welcome (but I'm probably going to restart the affected service in a moment).

However I determined that log output from consul-template does merge into nomad's logs.

Observation 1 When I register/deregister a new instance of the affected service type, no consul-template output is emitted from nomad, even at trace level.

Observation 2 When one of the other services has changes in consul (one of the other {{range service "..."}} blocks), template rendering is triggered. Then I see a bunch of lines from nomad/consul-template like:

[DEBUG] agent: (runner) health.service(service-name|passing) is still needed

The "stuck" service is included in this output (i.e. "is still needed").

Observation 3 When the template is re-rendered due to one of the unaffected service blocks, I see output like:

2022-07-05T15:59:46.630Z [TRACE] agent: health.service(working-service|passing): returned 155 results
2022-07-05T15:59:46.630Z [TRACE] agent: health.service(working-service|passing): returned 155 results after filtering

I see these lines for all the other services, but no such lines are emitted about the service that is failing to update.


In summary: the affected {{range service "foo"}} block does not trigger a template render when changes occur in consul, and does not fetch updated results when a render is triggered by another dependency.

mwild1 avatar Jul 05 '22 16:07 mwild1

Thanks for the new information @mwild1. Much appreciated.

I've removed the unreproducible tag and will add it to an upcoming milestone to see if I can reproduce it. I'm currently working on Envconsul (I rotate between this, envconsul and consul-esm) but should get back here at some point fairly soon. If you have any more info or find an easy way to replicate this with just consul-template that would be a big help.

You might also consider filing this issue with the Nomad. They might be able to help as they have a better understanding of how Nomad uses consul-template where I don't have a lot of experience.

Thanks!

eikenb avatar Jul 07 '22 23:07 eikenb

This issue appears to have reared its head again for me. I don't know why, it has been so long since it last occurred that I had completely forgotten about it and spent half of today re-debugging it.

I'm now running Nomad 1.4.13 (and whatever consul-template version comes with that), but otherwise not much has changed since last time. As noted in my last comment, just one of the service types appears to be "stuck" and rendering stale data.

mwild1 avatar Nov 20 '23 18:11 mwild1

For reference purposes, I'm cross-referencing this closed/locked issue in the nomad repo, which appears to be the same issue but was never solved: https://github.com/hashicorp/nomad/issues/11558

I'm not going to open a new issue for nomad at this time, because I'm not on the latest stable version (which also means I'm behind on consul-template too). I'll endeavour to update to the latest versions and see if this issue occurs again in another 12 months :)

In the meantime, any suggestions on how/what to debug if it does happen again would be very welcome.

mwild1 avatar Nov 20 '23 18:11 mwild1