One template renderer stops working while others continue normally
Consul Template version
$ ./consul-template -v
consul-template v0.19.3 (ebf2d3d)
Config
$ cat /etc/consul_template.d/prometheus.consul_template.conf
template {
source = "/opt/app/prometheus/var/prometheus-rules.yml"
destination = "/opt/app/prometheus/var/prometheus-rules.yml.consul"
command = "svc -h /opt/service/prometheus"
left_delimiter = "[["
right_delimiter = "]]"
}
Command
./consul-template -log-level info -kill-signal SIGTERM -consul-addr 127.0.0.1:8500 -consul-retry -consul-retry-attempts 0 -consul-retry-backoff 1s -consul-retry-max-backoff 10s -max-stale 2m -config /etc/consul_template.d -template last_updated.prom:/var/prometheus/metrics/consul_template_last_updated.prom
Debug output
We have not caught a similar circumstance with debug/trace logging turned on; here are the info logs showing that the regular (usually every 5 minutes) canary job stopped running, but after a key update, the other renderer worked as expected:
2019/04/01 13:32:33.676276 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 13:37:37.669226 [INFO] (runner) initiating run
2019/04/01 13:37:37.690627 [INFO] (runner) rendered "last_updated.prom" => "/var/prometheus/metrics/consul_template_last_updated.prom"
2019/04/01 13:37:37.691205 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 13:42:46.963737 [INFO] (runner) initiating run
2019/04/01 13:42:46.998530 [INFO] (runner) rendered "last_updated.prom" => "/var/prometheus/metrics/consul_template_last_updated.prom"
2019/04/01 13:42:46.999119 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 16:27:58.180901 [INFO] (runner) initiating run
2019/04/01 16:27:58.191640 [WARN] (runner) watching 370 dependencies - watching this many dependencies could DDoS your consul cluster
2019/04/01 16:27:58.274394 [INFO] (runner) initiating run
2019/04/01 16:27:58.288077 [INFO] (runner) rendered "/opt/app/prometheus/var/prometheus-rules.yml" => "/opt/app/pro
metheus/var/prometheus-rules.yml.consul"
2019/04/01 16:27:58.289028 [INFO] (runner) executing command "svc -h /opt/service/prometheus" from "/opt/app/promet
heus/var/prometheus-rules.yml" => "/opt/app/prometheus/var/prometheus-rules.yml.consul"
2019/04/01 16:27:58.289102 [INFO] (child) spawning: svc -h /opt/service/prometheus
Expected behavior
What should have happened? In lieu of a health-check endpoint for the server, we setup a canary template that pulls a timestamp from consul and renders it to be scraped by prometheus to ensure that the consul-template service is still running as expected:
$ cat last_updated.prom
# HELP consul_template_last_timestamp_seconds The last timestamp pulled by from consul kv by consul_template.
# TYPE consul_template_last_timestamp_seconds gauge
consul_template_last_timestamp_seconds {{ key "service-overwatch/timestamp" }}
We expected that if one renderer (ie. the timestamp canary) faulted / stopped rendering, that would be an indication that the process as a whole had failed.
Actual behavior
What actually happened?
This weekend, the consul-template process remained up, and continued to render other templates (ie. /opt/app/prometheus/var/prometheus-rules.yml) and produce log messages, but no longer ran the timestamp update template.
Steps to reproduce
While we have not yet found a deterministic process to reproduce this scenario, our concern is that our approach to monitoring the consul-template service is not as effective as we thought.
References
- #570
Did this ever work on older versions of consul-template? Which ones? Try both versions 0.19.5 and 0.20.0 since they're the most recent releases.
Hey @jinksunk, thanks for taking the time to report this.
Without a way to reproduce it I'm not sure I'll be able to do much about this other than keep an eye out for it. I think the eventual solution will be adding a service monitoring for consul-template as is discussed in your referenced ticket (#570).
I have seen this too. Also with no deterministic way to reproduce. :-(
Thanks for chiming in @liebman, if you ever get more information on this please let us know.
I'll continue to keep an eye out for what might be causing thisand will update here if I find anything.