consul-esm
consul-esm copied to clipboard
non tcp-http checks: script checks
Is it possible to run custom health checks with shell scripts? If not, why?
Yes, it's possible to define custom script checks with -enable-local-script-checks
. Please refer to the following docs:
- https://www.consul.io/docs/agent/checks.html (Script + Interval section)
- https://www.consul.io/docs/agent/options.html#_enable_local_script_checks
- https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations
Lastly, issues on GitHub for Consul are intended to be related to bugs or feature requests. A question like this would be a better fit for the HashiCorp forum: https://discuss.hashicorp.com/c/consul
@luddite478, my apologies, my response was geared towards Consul not Consul-ESM.
It is currently not possible to run script checks with Consul-ESM. The main reason is that HTTP/TCP checks have satisfied the use-cases we have seen so far.
We are open to a PR for the change, just have not seen the interest before this issue.
I'd love for alias service checks to be supported, that cascades failures for dependent external services. For example, if you externally monitor a third-party service (say github), and your own externally monitored service. It would be great to be able to alias that when the github check fails, your own service is failing too.
This works fairly well in consul agent config but seems to only be stored in the agent config, rather than in the catalog, and the definition in /v1/health/checks/:service
is empty after agent startup. If you try add an external check to the catalog via the API inline with check definition
documentation, the definition in /v1/health/checks/:service
remains empty.
Thus, I dont think consul-esm would be able to monitor and update aliased service checks anyway without features being added to consul proper, is that correct?
I think like @luddite478 said that this is interesting for example to monitor an external service like cloud managed databases (DocumentDB, RDS..) or similar and we can have a custom scripts that runs queries or something else to this services.
BR
Hi @scottaubrey - apologies for the delayed reply. Thanks for your interest in an alias checks enhancement. Would you be able to open a new ticket for this enhancement so that we capture and gauge interest in the feature?
If you try add an external check to the catalog via the API inline with check definition documentation, the definition in /v1/health/checks/:service remains empty.
The Service Check API /v1/health/checks/:service
queries the checks table by service. To clarify, it looks at the value in Check.ServiceID
and not Service.ID
in the example snippet of a Catalog Register payload
…
"Service": {
ID: "web1"
}
…
"Check": {
"Node": “ foo”,
"Name": “ service:web1”,
"ServiceID”: “ web1”,
"Status": "critical",
"Definition": {
"HTTP": "http://localhost:8080/health",
"Interval": "10s",
"Timeout": "10s",
"Method": "GET"
}
}
…
In order for the Service Check API to return that check in its response, Check.ServiceID
must have a value. If it doesn't have a value, then that check is considered a node-level health check instead of a service-level one. If you did not include Check.ServiceID
, it could be a reason why you are seeing an empty response. If this does not solve the issue you're experiencing, please open a new issue with reproducing steps.
I dont think consul-esm would be able to monitor and update aliased service checks anyway without features being added to consul proper, is that correct?
From what I understand, it sounds like the feature you’re interested in is alias health checks such that an external service can have a health check that depends on the health of another external service. If so, from an initial investigation, it looks as though changes should be in consul-esm rather than consul core. ESM currently retrieves all checks, including alias ones, but doesn’t process them. We would need to add code so that ESM handles alias checks rather than logging a warning.
Again, welcome to open a new issue for this!
Hi @luddite478 and @daktari - sincere apologies for the delayed reply and thanks for your interest in script-type health checks.
I did an initial look into how we could support script-type health checks. This feature would potentially be possible by allowing script-type health checks be registered via local configuration and not via HTTP request. Would this still satisfy your use case?
To elaborate, the workflow for ESM script-type health checks to be registered via local configuration would be similar to how agent service health checks can be written as service definitions placed in a configuration directory which Consul loads. The reason for not supporting HTTP request registration is because it would open up remote code execution security risk. For more details, please take a look at https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/
This was brought up in a Discuss thread, found here
Hi @luddite478 and @daktari - sincere apologies for the delayed reply and thanks for your interest in script-type health checks.
I did an initial look into how we could support script-type health checks. This feature would potentially be possible by allowing script-type health checks be registered via local configuration and not via HTTP request. Would this still satisfy your use case?
To elaborate, the workflow for ESM script-type health checks to be registered via local configuration would be similar to how agent service health checks can be written as service definitions placed in a configuration directory which Consul loads. The reason for not supporting HTTP request registration is because it would open up remote code execution security risk. For more details, please take a look at https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/
Hi @lornasong, if I understand you, this local configuration must be placed in consul-esm custom config path?
We are deploying consul-esm with Nomad and we are thinking about download those files and scripts as artifacts in /local or /secrets directory.
Hi @daktari,
Thanks for the response. I was actually considering the local configuration be placed in consul’s config directory (specifically the one set by the -config-dir
path) rather than consul-esm's config path.
When the consul agent starts up, it can load the checks in the configuration to its catalog. If the configuration is loaded by consul-esm instead, consul-esm would have to remotely register any script checks with consul catalog, which I'm trying to avoid in order to prevent opening up remote code execution security risk.
Let me know your thoughts!
@daktari Is there a possible architecture where script checks can be registered with consul-esm through local configuration, but then referenced by HTTP registration? It would essentially work as a whitelist of script checks which could be used. This way you'd keep the dynamism of HTTP registration without opening RCE holes.
Sorry for the late response. We are deploying consul ESM with Nomad and registering the services with Terraform. If we can download those scripts at consul-ESM deployment (with artifact stanza) and reference them in consul ESM configuration would be awesome. Otherwise we have to think about other solution. We are building image with packer and it is not very agile to add or modify scripts between image releases
Oops, I meant to tag @lornasong in my comment
Hi @daktari and @cbroglie, thanks for your responses. I wonder if I miscommunicated my thinking, particularly around the term “script check”, and I want to try to clarify and make sure I understand both of you.
Here are two parts to the script check: A. The json configuration to register the external health check with Consul catalog B. The actual script code, which might be in a file
The idea that I am proposing is that the configuration (Part A) could be loaded by Consul and the script code (Part B) should be located locally to where ESM is running.
For example, consul will load from -config-dir
a registration like:
// Part A
{
"Node": "script_node",
"NodeMeta": {
"external-node": "true",
},
…
"Check": {
"Node": "script_node",
"CheckID": "script_esm_check",
"Definition": {
"ScriptArgs": [
"/usr/local/bin/check_mem.py", "-limit", "256MB"
],
}
}
}
Then ESM learns from consul that it has to run that check. ESM has /usr/local/bin/check_mem.py
(Part B) locally and executes the python file.
I think I confused you to think I was proposing the script code (Part B) should be with Consul. I think you two are saying that script code (Part B) should be locally with ESM. If so, I completely agree. Separately, I want to understand if the configuration (Part A) is with consul, would that work and provide enough flexibility for your use cases?
Please let me know if I still misunderstand anything
@lornasong I understood your proposal, I was just proposing something else as it wouldn't help out my use case :) Let me elaborate more.
We have a process which synchronizes Kubernetes Ingress resources with an external Consul cluster. The ingress hostnames are registered as external services in Consul, and consul-esm performs HTTP health checks. We have a similar process for LoadBalancer services and TCP health checks, but 1 blocker is Consul's lack of support for UDP health checks. The initial plan was to implement UDP health checks using a script check and netcat, but I understand the security implications of why you wouldn't want to allow the configuration of arbitrary script checks via HTTP.
Allowing local registration of script checks will help users who have relatively static lists of external services, but it doesn't help my use case where the registration is dynamic and external to the consul-esm nodes. My proposal is to allow local configuration of consul-esm agents to declare which scripts can be used, something like:
{
"ScriptID": "script_udp_check",
"Definition": {
"ScriptArgs": ["/usr/local/bin/check_udp.py", "$ADDRESS:$PORT"]
}
}
Then the registration call could reference the script by id:
"Checks": [
{
"Name": "udp-check",
"ServiceID": "some-udp-service",
"status": "passing",
"Definition": {
"ScriptID": "script_udp_check"
"interval": "30s",
"timeout": "10s"
}
}
]
Obviously there is no mechanism like that yet, I was just proposing it as a possible solution. But if my use case is rare, perhaps it would be better to just add first class support for UDP health checks to Consul rather than trying to open up scripting in consul-esm.
Hi @cbroglie,
Thanks for response and all the details about your setup. That’s really helpful!
From reading your use case, my initial thought is to support first-class UDP external health checks as you suggested and keep script checks less dynamic for now (still open to hearing use cases).
It seems like script checks are generally custom and targeted and not so much for dynamic nodes/services. That being said, I’d want continued input from you and the community on how esm is actually being used! It seems like the fact that you’re using this to do a UDP check is the root of the need for dynamic script checks, which is why I lean towards the first-class UDP support.
Regarding your proposal of the ScriptID
definition that would live on esm, am I right in understanding that $ADDRESS:$PORT
would be of the address + port for the some-udp-service
and would be retrieved from esm's environment? Right now, I'm not sure I follow how $ADDRESS:$PORT
will be dynamic and capture the address + port if another service came online. For example, if some-udp-service-2
comes online, would you mind describing more how that would play out? Would there be another esm for this service that has its address + port in its environment?
If you think this all makes sense and I don't misunderstand again :), I’ll update this issue’s title to be specifically around script checks and open a new issue specifically for a UDP health check feature request. Let me know your thoughts.
@lornasong Yep, you understood me correctly and I agree a feature request for UDP health checks is probably the right path forward for my use case.
Regarding your proposal of the ScriptID definition that would live on esm, am I right in understanding that $ADDRESS:$PORT would be of the address + port for the some-udp-service and would be retrieved from esm's environment?
I was thinking these could be populated from the Address
and Service.Port
values in the /catalog/register call, but I haven't actually thought through whether something like that were possible. But the idea is they are dynamic, not tied to a static check.
Hi @cbroglie, thanks for the reply checking my understanding.
I was thinking about this some more. It occurred to me that esm does support a UDP check when you set the node-meta data to "external-probe": "true"
, see example below. This will perform a regular UDP ping that is intended to be similar to a Consul agent’s serf health check and captured in the externalNodeHealth
check. I’m not sure the details of your use-case but wanted to raise in case. Do you think this would helpful for your use case?
{
"Node": "http_node",
"Address": "http://localhost:8400",
"NodeMeta": {
"external-node": "true",
"external-probe": "true"
},
"Check": {
…
}
}
Oh that's great, I do think that will work! I had initially rejected that thinking it was an ICMP ping, but I see now it defaults to a UDP ping, which should work. Thanks!
Edit: never mind, I didn't fully understand UDP ping checks at first, but it's not going to help me. Thanks for pointing it out, though.
@cbroglie thanks for the follow up. Would you be able to share why the UDP external probe didn’t work out for you? I’d like to better understand what you’re looking for in a UDP health check - maybe there’s something requires scripting? - since the UDP external probe seems insufficient for your use case.
The UDP probe doesn't solve my case b/c it's a reflection of overall node health, not whether a service on a specific port is healthy. It does suffice for a reachability check, which is better than nothing. But to do better with UDP I think you need to be able to define the requests and expected responses like the nginx examples (this is roughly analogous to HTTP health checks).
@cbroglie, thank you, that is helpful to know. I created https://github.com/hashicorp/consul-esm/issues/59 to capture this feature. Let's continue any conversation over in that issue. Please share any new feedback or more details that I missed to that issue. Thank you!
@luddite478, I am planning to update the name of your issue to be specifically about script checks to match your description and narrow the scope of this issue. Please let me know if you have any concern. Also, if you had any feedback on my comment, please don't hesitate to let me know.
@daktari, it seems like this comment may still be relevant to your use case. It regards a possible misunderstanding around 'script checks'. If you have feedback, please let me know.
Thank you
consul & consul-esm provides a basic framework for service/process inspection, non tcp-http checks can be used to check the status of a process(running or exit)