consul-template
consul-template copied to clipboard
Monitoring consul-template
Is there a way to have a status/health endpoint on consul-template for monitoring that everything is ok and it can connect to consul?
Hi @rhoml
This is not currently possible, but it's definitely something worth considering. Originally I thought a simple HTTP server with a status endpoint would be useful, but my fear is that many users run multiple instances of consul template on a single machine, and that could cause port collisions, etc. I'm thinking, instead, or having CT respond to one of the user-defined signals (USR1 perhaps) and returning a result that way, but I don't think it's possible to return a result to a signal call.
/cc @slackpad
The ability to see the status of a consul-template instance via Consul would be interesting. How about having consul-template register as a "local service" (doesn't exist yet, but a service without a DNS entry) and have it post a status via a TTL check. Would that work for you?
@sean- I like this! I think it would be a good pattern to establish for all "local" consul tooling (CT, envconsul, consul-replicate, etc). What do we need to do in Consul to make that possible?
We'd need a "local" flag for a service registration in consul. Services with the local flag wouldn't have an address and could not be looked up via DNS (but do show up via the http API). But, it solves the ACL issue and provides an endpoint to register checks through which we can post status.
We could use a magic address like 0.0.0.0 to specify this type, but I'm not wild about conflating addresses with reg types and think registration needs a flag. We'll talk it up tomorrow and see what's involved.
I like the idea of consul-template reporting health via TTL, and also being tagged local. Monitoring a service without an address definitely is valuable.
But I think many of the use-cases for local do require an address. I have services that are local to the Consul agent, and only listen on one address. Not all my services support multiple listeners, and sometimes are purposefully segregated by address/interface. I would still want to be able to filter these services as 'local'.
It seems like it should be easy to attach a consul-template TTL check as one of the checks for whatever service consul-template is managing, not necessarily as a separate service of its own. If consul-template dies then your instance of the service is suspect because it's no longer getting configured properly. With that it will be clear what's affected vs. just knowing that one of the consul-template instances is down.
Talking to @sean- offline I'm coming around to some of the earlier suggestions. Perhaps a local service can have a pid
defined and no address/port which would keep it out of DNS. Tools like consul-template could register under the consul-template
service name and perhaps register some extra details like command line so operators could figure out which instance it was and what it was doing.
@doublerebel I don't think I fully understand your use case for a local service that still has an address/port. Are you thinking along the lines of https://github.com/hashicorp/consul/pull/1231#issuecomment-142059460 where you want to find the instances of a given service running locally on the box with a particular agent?
@slackpad would be nice to have as part of all Vault, CT, envconsul services - this kind of service registration(s) in Consul
@jippi totally that would be neat for monitoring all core services
@slackpad thanks for the consideration. I have run into issues where consul-template dies and a long-running service doesn't discover it until long after, when it finally restarts. Then the cause (of dead consul-template) is difficult to correlate with the effect (the service in a bad state). Especially when the service (without consul-template) goes back to a default value, so it's almost-but-not-quite right.
Re: local You're correct in referencing consul#1231, it's just the semantics of how consul defines local. Perhaps services without an address could be called "internal" to differentiate them from "local"? i.e. I am implementing the vault cubbyhole method which requires my co-process to be able to find "local" services which may or may not be "internal". But now I fear I'm derailing this issue into the local topic.
When running consul-template as a core service in a cluster (eg, not on nomad, but as a service which is available irrespective of nomad's status), it's difficult to properly register consul-template as a service and ensure the health checks are correct. It would be very helpful if consul-template were to register itself as a service in the consul catalog.
+1 for this! A simple HTTP endpoint would be enough in our case (we run everything in separate containers).
+1 for this! It would be great to register consul-template as "local/internal" service with health check on consul. @sethvargo: is there any plan to implement this idea?
Originally I thought a simple HTTP server with a status endpoint would be useful, but my fear is that many users run multiple instances of consul template on a single machine, and that could cause port collisions, etc.
@sethvargo what about having those endpoint as an options, such as
consul-template -health=enabled -health.port=8080
I believe it would solve problem for some people.
In other similar situations having something like a "checkpoint" status file has always been good enough for us and maybe it is simpler to implement than a full fledged HTTP server/endpoint.
I wouldn't keep constantly updating the destination file timestamp because it could cause nasty side effects with some softwares consuming that file, but I'd add a configuration key that accepts a file path and keeps touch 'ing it (updating its modification time) at regular intervals to signal "consul-template is working correctly and we are sure that the destination file is up to date with what was in consul at this time".
Bonus points if it allows the "checkpoint file" to be the same as the output file, so people can choose between leaving the output file mtime unmodified and track status with a different file, or have everything in one file that keeps getting its mtime updated.
Common monitoring systems have the ability to check for a file "freshness", usually out of the box (eg. check_file_age), and it is also really easy to check within shell scripts either for "max age" (eg. find -mmin
) or comparison with other files (eg. if [ checkpoint_file -ot some_reference_file ]
)
I would love a good way for containerpilot to monitor consul-template's health.
For now I'm just using pgrep
, this way I can chain jobs together via once-healthy. All it does is verify there is a process with the name consul-template
running.
{
"name": "consul-template",
"exec": [
"consul-template",
"-config",
"/app.hcl"
],
"when": {
"source": "consul-agent",
"once": "healthy"
},
"health":{
"exec": [
"test",
"$(pgrep consul-template | wc -l) -eq '1'"
],
"interval": 15,
"ttl": 25,
"timeout": "1s"
},
"restarts": "unlimited"
}
This issue is pretty old, what is the current best practice for monitoring consul-template?
Hey @drawks,
You might want to consider asking this on hashicorp's discuss forum, more community members probably would see it there and be able to rely their solutions.
I think the answer might just be that as consul-template is designed to exit if anything bad enough to trigger a failed health check happens (or at least that's the idea), so the normal process management setups you get from systemd, etc. work to keep it running without needing an external health monitor. That plus a monitor on the process consul-template is managing, which you'd need anyways, is probably enough for most cases. Though you probably want to take this with a grain of salt as I'm just the maintainer, I don't actively use consul-template in the field at the moment and can only base my answers on past experiences and what I hear from everyone else.
Thanks.
While I don't use consul-template
anymore - at some point, we had Prometheus's node_exporter monitoring the systemd
task for it, and had this alarm defined in Prometheus as:
avg_over_time(node_systemd_unit_state{name="consul-template.service",state="active"}[5m]) < 1