terraform-provider-pagerduty icon indicating copy to clipboard operation
terraform-provider-pagerduty copied to clipboard

pagerduty_service_event_rule is unreliable, 404 Not Found during apply

Open idsvandermolen opened this issue 3 years ago • 6 comments

Terraform Version

Run terraform -v to show the version. If you are not running the latest version of Terraform, please upgrade because your issue may have already been fixed.

Terraform v1.0.11
on darwin_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/pagerduty/pagerduty v2.1.1

Affected Resource(s)

Please list the resources as a list, for example:

  • pagerduty_service
  • pagerduty_service_event_rule

If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.

Code

We have a service module with main.tf like this:

#
# Setup a service with a some defaults like:
# - a relation to an escalation profile
# - a service_event_rule for critical events
# - a service_event_rule for warning events
# - a relation to a business service
#

resource "pagerduty_service" "service" {
  name                    = var.name
  description             = var.description
  auto_resolve_timeout    = "null"
  acknowledgement_timeout = 1800
  escalation_policy       = var.escalation_policy_id
  alert_creation          = "create_alerts_and_incidents"

  incident_urgency_rule {
    type    = "constant"
    urgency = "severity_based"
  }
}

resource "pagerduty_service_event_rule" "critical" {
  service  = pagerduty_service.service.id
  position = 0

  conditions {
    operator = "and"
    subconditions {
      operator = "equals"
      parameter {
        path  = "severity"
        value = "critical"
      }
    }
  }

  actions {
    priority {
      value = data.pagerduty_priority.p2.id
    }
  }
}

resource "pagerduty_service_event_rule" "warning" {
  service  = pagerduty_service.service.id
  position = 1

  conditions {
    operator = "and"
    subconditions {
      operator = "equals"
      parameter {
        path  = "severity"
        value = "warning"
      }
    }
  }

  actions {
    priority {
      value = data.pagerduty_priority.p4.id
    }
    suppress {
      value = true
    }
  }
}

resource "pagerduty_service_dependency" "business-service" {
  dependency {
    dependent_service {
      id   = var.business_service_id
      type = "business_service"
    }
    supporting_service {
      id   = pagerduty_service.service.id
      type = "service"
    }
  }
}

And then call the module like this:

module "translations-service" {
  source               = "./modules/service"
  name                 = "Translations"
  description          = "Translations MicroService"
  escalation_policy_id = pagerduty_escalation_policy.dummy.id
  business_service_id  = pagerduty_business_service.capabilities.id
}

Expected Behavior

When deploying larger changes we expect them to succeed.

Actual Behavior

What actually happened? During first apply we see messages about resources still being created, then it fails with a 404 Not Found:

module.translation-keys-service.pagerduty_service_event_rule.critical: Still creating... [2m0s elapsed]
module.translations-service.pagerduty_service_event_rule.critical: Still creating... [2m0s elapsed]

Error: GET API call to https://api.eu.pagerduty.com/services/PTKNIE0/rules/9bd3764b-8aa3-430c-a96c-6e709a4fbedc failed 404 Not Found. Code: 0, Errors: <nil>, Message: Rule Not Found

  with module.translations-service.pagerduty_service_event_rule.critical,
  on modules/service/main.tf line 23, in resource "pagerduty_service_event_rule" "critical":
  23: resource "pagerduty_service_event_rule" "critical" {


Error: GET API call to https://api.eu.pagerduty.com/services/PB5JK5D/rules/43a5417d-efb3-4224-8e4a-44167c35ee41 failed 404 Not Found. Code: 0, Errors: <nil>, Message: Rule Not Found

  with module.translation-keys-service.pagerduty_service_event_rule.critical,
  on modules/service/main.tf line 23, in resource "pagerduty_service_event_rule" "critical":
  23: resource "pagerduty_service_event_rule" "critical" {


Error: Error updating service event rule 377c56ed-bf5b-4eea-b1ac-35c19f3deebd position 0 needs to be 1

  with module.translations-service.pagerduty_service_event_rule.warning,
  on modules/service/main.tf line 45, in resource "pagerduty_service_event_rule" "warning":
  45: resource "pagerduty_service_event_rule" "warning" {


Error: Error updating service event rule db05de09-a9fb-4292-8561-3b0f5a03e940 position 0 needs to be 1

  with module.translation-keys-service.pagerduty_service_event_rule.warning,
  on modules/service/main.tf line 45, in resource "pagerduty_service_event_rule" "warning":
  45: resource "pagerduty_service_event_rule" "warning" {

Error: Process completed with exit code 1.

If you try to run the terraform plan, the refreshing fails with the same error (404 Not Found). The work-around is to delete these "not found" resources from terraform state with terraform state rm <resource> and try again.

Note: we create the pagerduty_service and the accompanying service_event_rules in the same terraform apply. There might be a race condition where the terraform pagerduty provider does not process service => service_event_rule dependency correctly

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Important Factoids

Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

  • might be related to https://github.com/PagerDuty/terraform-provider-pagerduty/issues/426
  • also a pending PR https://github.com/PagerDuty/terraform-provider-pagerduty/pull/380
  • and https://github.com/PagerDuty/terraform-provider-pagerduty/pull/274

idsvandermolen avatar Dec 03 '21 09:12 idsvandermolen

I noticed the same thing and that the position was not really being stable either, one workaround I found was to make each service event rule depend on the pervious one so that they are created in the correct order.

jjm avatar Dec 07 '21 16:12 jjm

I noticed the same thing and that the position was not really being stable either, one workaround I found was to make each service event rule depend on the pervious one so that they are created in the correct order.

Thanks, that would help making sure you don't have to apply multiple times to get the order of rules correct. However, it doesn't solve the race condition where terraform thinks something has been deployed and adds it to the state file and PD API doesn't know about the resources and returns a 404

idsvandermolen avatar Dec 08 '21 08:12 idsvandermolen

@idsvandermolen very true, the a release may have fixed the race condition. As I'm no longer seeing these errors & the PR you linked to has been merged.

jjm avatar Dec 09 '21 08:12 jjm

@jjm Run into this issue as well, and am still seing it with latest provider release (v2.2.0).

Running the terraform apply with TF_LOG=debug show that API acknowledge resource creation. But further API calls result in Resource not found errors.

In my opinion, this is more an Pagerduty API bug than a Terraform provider one. I've opened a case to Pagerduty support, providing full debug log and request IDs so that they can investigate this issue.

jbfavre avatar Dec 10 '21 11:12 jbfavre

:wave: Got an answer from PagerDuty support:

We've looked into this and determined that the bug results from Terraform attempting to process the event_rule calls concurrently. We are looking into changes to either the API protocol or the Terraform integration code still.

BUT, we do have a work around that should work: simply disabling parallel processing of commands in Terraform.

This is controlled by the variable parallelism, which defaults to the value of 10. You can disable concurrent processing by setting it to 1.

parallelism = 1

(If you'd like you can see more at "Walking the Graph" in Terraform's documentation here.)

I'm currently setting up the workaround. Will keep this issue updated

jbfavre avatar Jan 05 '22 16:01 jbfavre

Hello! I have connected with the Engineering team and can confirm that they recommended slowing down the requests for creating rules, as @jbfavre mentioned above.

An approach I just tested with the code above is to add a depends_on field to the warning rule. That way the creation of that rule waits for the critical rule to created before beginning its request. The benefit of this approach over setting parallelism is that you're only slowing down rule creation and not the whole terraform apply process.

Here's an example of using the depends_on field with the code above.

resource "pagerduty_service_event_rule" "warning" {
  service  = pagerduty_service.service.id
  position = 1
  depends_on = [
    pagerduty_service_event_rule.critical  
  ]
...

stmcallister avatar Jan 12 '22 18:01 stmcallister

Rules are being deprecated and replaced

gsreynolds avatar Jun 28 '23 14:06 gsreynolds