terraform-provider-grafana icon indicating copy to clipboard operation
terraform-provider-grafana copied to clipboard

Terraform apply fails if it can't hit the grafana front end

Open alyssaburlton opened this issue 5 years ago • 3 comments

Summary

The way that terraform checks Grafana resources (e.g. a data source or a dashboard) relies on the front-end grafana API. This causes it to fail in cases where the resources themselves are actually fine, e.g. because the DNS record is currently down.

In my case, I was making a change that involved deleting and recreating an instance of Traefik in our Kubernetes cluster, which was responsible for routing to Grafana (among other things). I deleted our Traefik helm release, then ran terraform apply (in order to reinstate it) but was blocked because of the grafana resources timing out during the check.

Ideally, the check would not rely on front-end availability. If this isn't possible, providing some option to say "yeah, I was expecting that, please carry on and ignore those resources" would be acceptable. As things currently stand, I ended up:

  • Running terraform state rm to purge all the affected Grafana resources.
  • Running terraform apply to provision Traefik (which obviously then got errors trying to create grafana resources that actually already existed)
  • Manually resolved the conflict I had now created by deleting the whole grafana release and letting terraform reprovision it.

This was far from ideal!

Terraform Version

Terraform v0.11.13
+ provider.grafana v1.3.0

Affected Resource(s)

  • grafana_alert_notification
  • grafana_data_source
  • grafana_dashboard

Terraform Configuration Files

provider "grafana" {
  url  = "my_url"
  auth = "admin:${var.grafana_password}"
  version = "~> 1.3"
}

resource "grafana_data_source" "elasticsearch" {
  type          = "elasticsearch"
  name          = "elasticsearch"
  url           = "http://elasticsearch-master.elasticsearch:9200"
  is_default    = true
  depends_on    = ["helm_release.elasticsearch", "helm_release.grafana"]
}

resource "grafana_dashboard" "my_dashboard" {
  config_json   = "${file("my-dashboard.json")}"
  depends_on    = ["helm_release.elasticsearch", "helm_release.grafana"]
}

Debug Output

I've copied out a subset of the errors and redacted IPs/passwords etc, but you get the idea:

Error: Error refreshing state: 2 error(s) occurred:

* grafana_alert_notification.slack: 1 error(s) occurred:

* grafana_alert_notification.slack: grafana_alert_notification.slack: Get https://admin:<password>@my_url/api/alert-notifications/1: dial tcp <IP>:443: i/o timeout

* module.efk_prod.grafana_data_source.elasticsearch: 1 error(s) occurred:

* module.efk_prod.grafana_data_source.elasticsearch: grafana_data_source.elasticsearch: Get https://admin:<password>@my_url/api/datasources/1: dial tcp <IP>:443: i/o timeout

Expected Behavior

Terraform should have queried the grafana resources directly (not over the front-end connection) and found that all the resources were still fine.

Actual Behavior

Terraform checked the status of the various grafana resources by connecting via the frontend, which was down due to a deliberate config change being made to Traefik. These checks timed out and the terraform apply was aborted.

Steps to Reproduce

  1. Provision grafana with one of the listed resources (e.g. a data source).
  2. Do something that causes the front end to be unavailable (in my case, I deleted the traefik helm release used for routing, but you could also delete the relevant DNS record etc)
  3. Attempt to do a terraform apply - you're not able to because terraform times out trying to hit the invalid endpoint.

alyssaburlton avatar May 01 '19 10:05 alyssaburlton

Hey @alexburlton, I'm not sure what you mean by Grafana's "front end" in this context. From my perspective, this would be expected behavior--if the Grafana server isn't able to accept traffic, the provider cannot retrieve the current state of the resources to reconcile any differences. Is there something I'm missing here?

mlclmj avatar Apr 18 '20 05:04 mlclmj

I guess what I'm asking for would be one of the following:

  • An alternative way to configure the provider which isn't by URL. In my case, our grafana instances are provisioned into kubernetes - having a way to auth to the cluster and communicate directly with the running Grafana instance would decouple these resources from DNS. If I provision a change to the DNS, which leads the Grafana temporarily not being available via URL, terraform would still be able to run and make subsequent changes (as the actual underlying Grafana pod is fine, and tf would know that).

  • Failing the above, a way to stop these errors from blowing up terraform would be appreciated. They should still be logged, but currently they cause the entire terraform apply to fail which basically prevents you from doing anything to fix the problem (you have to start resorting to using -target, for example). In the case where my DNS provider has temporarily flaked, I would want terraform to come back with some warnings saying something like "Unable to verify resource grafana_dashboard.my_dashboard: timed out connecting to blah" but to still allow me to apply changes to the resources it was able to verify.

ghost avatar Apr 18 '20 11:04 ghost

Ah, I think I understand now, thanks for the clarity.

An alternative way to configure the provider which isn't by URL

Is there a way you can think of that we'd be able to do this? I'm not aware of any way that we could do this though if you can come up with a reliable method we'd welcome a PR.

Failing the above, a way to stop these errors from blowing up terraform would be appreciated.

This is definitely expected behavior and something we can't change. How I've seen this handled in most cases is either manual manipulation of the DAG to ensure resources are updated in the order needed, or through separate terraform runs separating out resources into logical "layers" of your stack. Common ways I've seen this done are having a separate run to provision loadbalancers, networking, etc. and another to manage the compute and application layers of resources.

mlclmj avatar Apr 18 '20 17:04 mlclmj

Agree with the previous response (from 2020 😄). This is something we won't do. Decoupling networking and infrastructure Terraform projects from application Terraform projects should be the way to go

julienduchesne avatar Aug 30 '22 12:08 julienduchesne