consul-terraform-sync icon indicating copy to clipboard operation
consul-terraform-sync copied to clipboard

CTS executing the task without any updates to the service

Open devarshishah3 opened this issue 4 years ago • 7 comments

Describe the bug

This bug is found in case for a PANW resource. Tasks are executed over and over again even if the services within the catalog do not change their state. This is especially concerning for a PANW module as CTS does an implicit "commit" every time a task runs without any changes.

Versions

Consul Terraform Sync

 0.1.0-techpreview1 (2862363)
Compatible with Terraform ~>0.13.0

Consul Version

Consul 1.8.4

Terraform Version

Terraform v0.13.3

Configuration File(s)

Reminder to redact any sensitive information that may be present in this file

Click to toggle contents of config file
log_level = "INFO"
consul {
  address = "34.206.3.65:8500"
}

driver "terraform" {
  log = true
  required_providers {
    panos = {
      source = "PaloAltoNetworks/panos"
      version = "1.6.3"
    }
  }
}

provider "panos" {
  alias = "panos1" 
  hostname = "<redacted>"
  api_key  = "<redacted>" 
}

task {
  name        = "automate-address-group"
  description = "automate address group and dynamic address tags"
  source      = "PaloAltoNetworks/ag-dag-nia/panos"
  #version    = "0.1.0"
  providers   = ["panos.panos1"]
  services    = ["web","api"]
}

Terraform Configuration Files Generated by Consul-Terraform-Sync

Reminder to redact any sensitive information that may be present in the files

Click to toggle contents of main.tf
// Copy generated main.tf here
Click to toggle contents of terraform.tfvars
# This file is generated by Consul Terraform Sync.
#
# The HCL blocks, arguments, variables, and values are derived from the
# operator configuration for Sync. Any manual changes to this file
# may not be preserved and could be overwritten by a subsequent update.

panos = {
  alias    = "panos1"
  api_key  = "<redacted>"
  hostname = "<redacted>"
}

services = {
  "api2.i-03c86bc24c675126b.us-east-1" : {
    id              = "api2"
    name            = "api"
    address         = "172.31.75.182"
    port            = 8090
    meta            = {}
    tags            = ["api", "dc1", "test"]
    namespace       = null
    status          = "passing"
    node            = "i-03c86bc24c675126b"
    node_id         = "34c7c895-fbb9-1a86-5f47-75283a349d81"
    node_address    = "172.31.75.182"
    node_datacenter = "us-east-1"
    node_tagged_addresses = {
      lan      = "172.31.75.182"
      lan_ipv4 = "172.31.75.182"
      wan      = "172.31.75.182"
      wan_ipv4 = "172.31.75.182"
    }
    node_meta = {
      consul-network-segment = ""
    }
  },
  "api0.i-051eb4d7c2ab0e661.us-east-1" : {
    id              = "api0"
    name            = "api"
    address         = "172.31.7.19"
    port            = 8090
    meta            = {}
    tags            = ["api", "dc1", "test"]
    namespace       = null
    status          = "passing"
    node            = "i-051eb4d7c2ab0e661"
    node_id         = "a78f2ff3-91aa-90aa-6f39-4726522e872e"
    node_address    = "172.31.7.19"
    node_datacenter = "us-east-1"
    node_tagged_addresses = {
      lan      = "172.31.7.19"
      lan_ipv4 = "172.31.7.19"
      wan      = "172.31.7.19"
      wan_ipv4 = "172.31.7.19"
    }
    node_meta = {
      consul-network-segment = ""
    }
  },
  "api1.i-054002ad10be452a7.us-east-1" : {
    id              = "api1"
    name            = "api"
    address         = "172.31.52.231"
    port            = 8090
    meta            = {}
    tags            = ["api", "dc1", "test"]
    namespace       = null
    status          = "passing"
    node            = "i-054002ad10be452a7"
    node_id         = "8f5e3fad-b5f5-bb89-409e-e52dcfb97314"
    node_address    = "172.31.52.231"
    node_datacenter = "us-east-1"
    node_tagged_addresses = {
      lan      = "172.31.52.231"
      lan_ipv4 = "172.31.52.231"
      wan      = "172.31.52.231"
      wan_ipv4 = "172.31.52.231"
    }
    node_meta = {
      consul-network-segment = ""
    }
  },
  "web0.i-0bff749211d19957b.us-east-1" : {
    id              = "web0"
    name            = "web"
    address         = "172.31.10.35"
    port            = 80
    meta            = {}
    tags            = ["dc1", "nginx", "test", "web"]
    namespace       = null
    status          = "passing"
    node            = "i-0bff749211d19957b"
    node_id         = "23722d1e-2c3b-ed68-8a7f-d14e2e2c4964"
    node_address    = "172.31.10.35"
    node_datacenter = "us-east-1"
    node_tagged_addresses = {
      lan      = "172.31.10.35"
      lan_ipv4 = "172.31.10.35"
      wan      = "172.31.10.35"
      wan_ipv4 = "172.31.10.35"
    }
    node_meta = {
      consul-network-segment = ""
    }
  }
}

Terraform Module

Module identified in the config file

Task Variable Files

No additional task variables file

Expected Behavior

Task should not execute if there is no change in the state of the services.

Actual Behavior

Tasks are being executed at a regular interval. My suspect is that every time a health check is run by Consul, a task is triggered, even though the status of the service instance has not changed.

Steps to Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Additional Context

Add any other context about the problem here.

devarshishah3 avatar Oct 22 '20 17:10 devarshishah3

Hi @devarshishah3, can you provide the health check definition corresponding to the 4 services?

I ran into this recently as well where the task would be executed after every health check interval. For my case, it turned out that the health check configured was an HTTP check hitting a URL that would return a slightly different response body each time. I saw that it changed only by a few characters but was enough for servicing monitoring to consider the health status different (source code). Does this sound like a similar scenario you are observing?

findkim avatar Oct 22 '20 17:10 findkim

yes. this sounds similar. But from an overall health status point of view, that slight change in the body does not affect the overall status of passing.

devarshishah3 avatar Oct 22 '20 18:10 devarshishah3

Health check definitions for all 4 services are pretty similar: "check": { "name": "web1 check", "service_id": "web1", "http": "https://google.com", "method": "GET", "interval": "10s", "timeout": "1s" }

devarshishah3 avatar Oct 22 '20 18:10 devarshishah3

Good to know we're seeing the same issue. Yes, I would agree that this is overly sensitive. Until this is addressed, a stop-gap would be to change the endpoint "https://google.com" to a more reasonable URL that does not change its body on subsequent requests, especially for placeholder health check endpoints that are not related to the health of the service.

findkim avatar Oct 22 '20 18:10 findkim

Hey @devarshishah3 wanted to update you with my thoughts on this issue.

I spent time triaging and looking into potential solutions, it seems like a new API for the service template function in hcat (source) would be an option to skip on changes to the check body response in order to avoid triggering tasks unnecessarily. A scenario I do want to be mindful of is the legitimate use case for templating the value of an HTTP check response body. A new template function, instead of patching the current functionality, would avoid introducing changes into Consul Template which may not be desirable for people using HTTP check response body.

I'm thinking since there is a current work-around for your case, I'll leave this open to see if others are running into this and checkout what their HTTP health check handling looks like before moving forward with a large change.

findkim avatar Oct 27 '20 17:10 findkim

I'm hitting the same issue. I'm targeting a vault cluster in my HTTP check, and the response body is different depending on which node gets the request. I need the service to go in degraded state if it can't reach vault and the check does the job perfectly, but I can see CTS constantly spinning and returning empty plans. I'm using the panos provider in my modules and it's constantly hitting the devices with empty commits.

danieleva avatar Aug 18 '21 15:08 danieleva

I though the underlying issue was that the task was triggered when the health check returned a different body even though the service health status had not degraded. From a module's perspective it should be a no-op but we are still queuing up plan/apply. In case of TFE/TFC integration, this could be an issue since the operator would be notified even if it is a no-op. This seems like a two tiered issue.

devarshishah3 avatar Aug 19 '21 23:08 devarshishah3