Healthchecks created via Terraform do not work
Terraform Version
> terraform -v
Terraform v0.12.24
+ provider.consul v2.6.1
> consul -v
Consul v1.7.2 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Affected Resource(s)
- consul_service
Terraform Configuration Files
provider "consul" {
address = "https://srv-pro-svrg-01"
datacenter = "dc1"
version = "~> 2.6"
}
data "consul_nodes" "read-dc1-nodes" {
query_options {
# Optional parameter: implicitly uses the current datacenter of the agent
datacenter = "dc1"
}
}
resource "consul_service" "redis" {
name = "redis"
node = "srv-pro-svrg-01"
port = 6379
check {
check_id = "service:redis1"
name = "Redis health check"
status = "passing"
http = "https://www.hashicorptest.com"
tls_skip_verify = false
method = "PUT"
interval = "5s"
timeout = "1s"
deregister_critical_service_after = "90m"
header {
name = "foo"
value = ["test"]
}
header {
name = "bar"
value = ["test"]
}
}
}
Debug Output
https://gist.github.com/vector623/d193f3292790bf7f1119c57bafd4e561
Expected Behavior
Health check should execute successfully. If it fails, it should not deregister for 90 minutes.
Actual Behavior
Health check fails and deregisters within a minute.
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
-
terraform init -
terraform apply -auto-approve
Important Factoids
- Running on-prem on Ubuntu 18.04 VMWare guest.
- Experienced same issue with Ubuntu 18.04 in GCP.
- Health checks created via HTTP/curl work fine.
References
Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:
- GH-124: https://github.com/terraform-providers/terraform-provider-consul/issues/124
- user closed issue before providing any further details
Hi @vector623, thanks for opening this issue, it seems that the health-check may not be the problem as the issue still appears when I try without it.
I will investigate and let you know what I found.
Hi @vector623, we've looked and I think you are trying to register a service on a node where a Consul agent is running (an internal service). The consul_service resource was created to register external services and adds the service to the Consul catalog but not to the local catalog of the agent. When the agent perform the anti-entropy syncs, it finds a service in the catalog it knows nothing about and removes it:
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]: 2020-03-25T18:57:23.181Z [DEBUG] agent: Node info in sync
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]: 2020-03-25T18:57:23.183Z [INFO] agent: Deregistered service: service=redis
Mar 25 18:57:23 srv-pro-schd-05 consul[32576]: 2020-03-25T18:57:23.184Z [INFO] agent: Deregistered check: check=service:redis1
The documentation of the provider (https://www.terraform.io/docs/providers/consul/r/service.html) mentions this briefly:
If the Consul agent is running on the node where this service is registered, it is not recommended to use this resource.
This is not related to the health-check and you should see the same behaviour when registering the service without the health-checks.
You mentioned that the same service created using cURL works, I think you are creating it using /v1/agent/service/register and not the /v1/catalog/register endpoint consul_service is using. Could you confirm that?
The consul_agent_service resource can be used to create an internal service but it was marked as deprecated and does not support health-checks at the moment. I'm wondering if we should rollback this deprecation.
@remilapeyre wouldnt it be possible to combine the two, and abstract away that complexity to users? Healthchecks ftw!!
Still cannot get TCP Health checks working, let alone HTTP health checks. Lets take two services as an example: Prometheus, which has to be configured using TCP checks on port 9090 and Grafana, which can be checked with a GET /api/health request on port 3000.
Tested on Consul v1.15.3
Curl Checks
I have Prometheus running on IP 192.168.55.120:
$ curl -i 192.168.55.120:9090
HTTP/1.1 302 Found
Content-Type: text/html; charset=utf-8
Location: /graph
Date: Tue, 11 Jul 2023 18:15:55 GMT
Content-Length: 29
<a href="/graph">Found</a>.
If Prometheus does HTTP responses, then it is surely giving out a collection of TCP packets.
I have Grafana running on IP 192.168.55.121:
$ curl -i 192.168.55.121:3000/api/health
HTTP/1.1 200 OK
Cache-Control: no-store
Content-Type: application/json; charset=UTF-8
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
Date: Tue, 11 Jul 2023 18:17:30 GMT
Content-Length: 71
{
"commit": "5a30620b85",
"database": "ok",
"version": "10.0.1"
}
Grafana working as well.
Configuring healthchecks with terraform-provider-consul
No lets create the necessary service healthcheck resources.
Prometheus configuration
Configuring Health checks for Prometheus
resource "consul_node" "node" {
count = 1
datacenter = "dc1"
address = "192.168.55.120"
name = "prometheus01"
}
resource "consul_service" "svc" {
count = 1
name = "prometheus01"
node = "prometheus01"
address = "192.168.55.120"
datacenter = "dc1"
port = 9090
check {
check_id = "service:prometheus01"
name = "Prometheus Health Check"
notes = "Checks for a TCP connection on port 9090"
tcp = "192.168.55.120:9090"
interval = "10s"
timeout = "2s"
deregister_critical_service_after = "60s"
}
}
Prometheus results
Grafana configuration
Configuring Health checks for Grafana
resource "consul_node" "node" {
datacenter = "dc1"
address = "192.168.55.121"
name = "grafana01"
}
resource "consul_service" "svc" {
name = "grafana01"
node = "grafana01"
address = "192.168.55.121"
datacenter = "dc1"
port = 3000
check {
check_id = "service:grafana01"
name = "Grafana Health Check"
http = "/api/health"
notes = "Checks for a GET /api/health request on port 3000"
tls_skip_verify = true
method = "GET"
interval = "10s"
timeout = "2s"
deregister_critical_service_after = "30s"
header {
name = "Accept"
value = ["application/json"]
}
}
}
Grafana results
Conclusion
With what has been demonstrated above, I have three questions:
- How is this still an issue after 3 years?
- As our company is in the process of Enterprise versions of HashiCorp products, what will support look like while taking the first question into account?
- Is question number two rhetorical?
Relevant issues: #124
Hi @mbrav . Not sure if I'm doing archaeology here, but I just struggled through this myself. This looks like a non-issue to me, although it didn't at first. It's non-issue because although the service and the service health check are declared, there is no external service monitor to actually perform the health checks.
I run consul_esm on my Nomad cluster to perform the health checks.
So, registered services start off critical, but are updated to healthy as they are discovered by consul-esm and their health checks are performed.