terraform-provider-grafana
terraform-provider-grafana copied to clipboard
Terraform apply fails if it can't hit the grafana front end
Summary
The way that terraform checks Grafana resources (e.g. a data source or a dashboard) relies on the front-end grafana API. This causes it to fail in cases where the resources themselves are actually fine, e.g. because the DNS record is currently down.
In my case, I was making a change that involved deleting and recreating an instance of Traefik in our Kubernetes cluster, which was responsible for routing to Grafana (among other things). I deleted our Traefik helm release, then ran terraform apply
(in order to reinstate it) but was blocked because of the grafana resources timing out during the check.
Ideally, the check would not rely on front-end availability. If this isn't possible, providing some option to say "yeah, I was expecting that, please carry on and ignore those resources" would be acceptable. As things currently stand, I ended up:
- Running
terraform state rm
to purge all the affected Grafana resources. - Running
terraform apply
to provision Traefik (which obviously then got errors trying to create grafana resources that actually already existed) - Manually resolved the conflict I had now created by deleting the whole grafana release and letting terraform reprovision it.
This was far from ideal!
Terraform Version
Terraform v0.11.13
+ provider.grafana v1.3.0
Affected Resource(s)
- grafana_alert_notification
- grafana_data_source
- grafana_dashboard
Terraform Configuration Files
provider "grafana" {
url = "my_url"
auth = "admin:${var.grafana_password}"
version = "~> 1.3"
}
resource "grafana_data_source" "elasticsearch" {
type = "elasticsearch"
name = "elasticsearch"
url = "http://elasticsearch-master.elasticsearch:9200"
is_default = true
depends_on = ["helm_release.elasticsearch", "helm_release.grafana"]
}
resource "grafana_dashboard" "my_dashboard" {
config_json = "${file("my-dashboard.json")}"
depends_on = ["helm_release.elasticsearch", "helm_release.grafana"]
}
Debug Output
I've copied out a subset of the errors and redacted IPs/passwords etc, but you get the idea:
Error: Error refreshing state: 2 error(s) occurred:
* grafana_alert_notification.slack: 1 error(s) occurred:
* grafana_alert_notification.slack: grafana_alert_notification.slack: Get https://admin:<password>@my_url/api/alert-notifications/1: dial tcp <IP>:443: i/o timeout
* module.efk_prod.grafana_data_source.elasticsearch: 1 error(s) occurred:
* module.efk_prod.grafana_data_source.elasticsearch: grafana_data_source.elasticsearch: Get https://admin:<password>@my_url/api/datasources/1: dial tcp <IP>:443: i/o timeout
Expected Behavior
Terraform should have queried the grafana resources directly (not over the front-end connection) and found that all the resources were still fine.
Actual Behavior
Terraform checked the status of the various grafana resources by connecting via the frontend, which was down due to a deliberate config change being made to Traefik. These checks timed out and the terraform apply
was aborted.
Steps to Reproduce
- Provision grafana with one of the listed resources (e.g. a data source).
- Do something that causes the front end to be unavailable (in my case, I deleted the traefik helm release used for routing, but you could also delete the relevant DNS record etc)
- Attempt to do a terraform apply - you're not able to because terraform times out trying to hit the invalid endpoint.
Hey @alexburlton, I'm not sure what you mean by Grafana's "front end" in this context. From my perspective, this would be expected behavior--if the Grafana server isn't able to accept traffic, the provider cannot retrieve the current state of the resources to reconcile any differences. Is there something I'm missing here?
I guess what I'm asking for would be one of the following:
-
An alternative way to configure the provider which isn't by URL. In my case, our grafana instances are provisioned into kubernetes - having a way to auth to the cluster and communicate directly with the running Grafana instance would decouple these resources from DNS. If I provision a change to the DNS, which leads the Grafana temporarily not being available via URL, terraform would still be able to run and make subsequent changes (as the actual underlying Grafana pod is fine, and tf would know that).
-
Failing the above, a way to stop these errors from blowing up terraform would be appreciated. They should still be logged, but currently they cause the entire
terraform apply
to fail which basically prevents you from doing anything to fix the problem (you have to start resorting to using-target
, for example). In the case where my DNS provider has temporarily flaked, I would want terraform to come back with some warnings saying something like "Unable to verify resourcegrafana_dashboard.my_dashboard
: timed out connecting to blah" but to still allow me to apply changes to the resources it was able to verify.
Ah, I think I understand now, thanks for the clarity.
An alternative way to configure the provider which isn't by URL
Is there a way you can think of that we'd be able to do this? I'm not aware of any way that we could do this though if you can come up with a reliable method we'd welcome a PR.
Failing the above, a way to stop these errors from blowing up terraform would be appreciated.
This is definitely expected behavior and something we can't change. How I've seen this handled in most cases is either manual manipulation of the DAG to ensure resources are updated in the order needed, or through separate terraform runs separating out resources into logical "layers" of your stack. Common ways I've seen this done are having a separate run to provision loadbalancers, networking, etc. and another to manage the compute and application layers of resources.
Agree with the previous response (from 2020 😄). This is something we won't do. Decoupling networking and infrastructure Terraform projects from application Terraform projects should be the way to go