terraform-provider-multispace
terraform-provider-multispace copied to clipboard
`context deadline exceeded` while triggered run is still queued
When a run is enqueued for a long time due to the available workers being tied up, the multispace run errors with context deadline exceeded
. I've noticed this specifically in destroy
runs. A custom timeout
has been set, but it doesn't seem to have effect for destroy
runs (the same issue on create
happens, but after the configured timeout as expected)
Terraform Version
Terraform 1.0.8, 1.0.9 multispace 0.1.0
Affected Resource(s)
Please list the resources as a list, for example:
- multispace_run
If this issue appears to affect multiple resources, it may be an issue with Terraform's core, so please mention this.
Terraform Configuration Files
resource "tfe_workspace" "app" {
for_each = local.apps
name = "app-${each.key}-${var.aws_region}-${var.environment}"
description = "Terraform configuration for app-${each.key}"
organization = var.tfe_organization_name
auto_apply = true
queue_all_runs = false
terraform_version = var.terraform_version
working_directory = "environments/${var.aws_region}/${var.environment}/apps/${each.key}"
trigger_prefixes = ["modules", "shared/app"]
tag_names = ["app", var.environment]
}
resource "tfe_variable" "environment" {
for_each = tfe_workspace.app
key = "environment"
category = "terraform"
value = var.environment
workspace_id = each.value.id
}
resource "multispace_run" "run" {
for_each = tfe_workspace.app
organization = var.tfe_organization_name
workspace = each.value.name
timeouts {
create = "1h"
delete = "1h"
}
depends_on = [
# wait for all vars to be set before triggering run
tfe_variable.environment,
]
}
Debug Output
Please provider a link to a GitHub Gist containing the complete debug output: https://gist.github.com/pedroslopez/fffcbb4f1786246ddea8d84dacfebac5
Gist from a different workspace where I was able to reproduce the issue.
Expected Behavior
What should have happened?
On destroy, the mutlispace_run
should have waited up to the configured destroy timeout while the related run was still queued, or ideally it should keep waiting as long as the run is still queued.
Actual Behavior
What actually happened?
After 15 minutes, the run failed with context deadline exceeded
. The run triggered by multispace_run
eventually ran once the workers became available, but by then the deadline error had already happened.
Steps to Reproduce
This can easily be reproduced in a free terraform cloud organization where there are not enough workers to process the triggered run. Just have the multispace_run trigger a destroy
run and see that it only waits up to 15 minutes, failing with context deadline exceeded
.
Important Factoids
Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? Custom version of OpenStack? Tight ACLs?
Pretty standard Terraform Cloud for Business organization, but we only have 3 workers so if multiple workspaces are being destroyed that take a long time to clean up the resources we run into this issue.
Hmmm I would've thought that the Terraform SDK handles that timeout
block for me since we're just using the context provided directly from the SDK. I'll have to do some digging!
Oh! So this is interesting. After seeing these errors popup we added the timeout
block as shown in the sample config file above, but applying those changes didn't actually set up the timeouts on the state file. The new runs created after we set up the timeout do have them in the state file. I'm assuming the timeouts need to be in the state file so when they're destroyed the right value is used.
The same can be reproduced in a simple usage of the multispace_run
resource - setting a timeout after creation or updating the timeout value has no effect (and terraform plan
shows no changes). I'm not sure if this is specific to this provider or something on a deeper level, though.
I do see that on resource update here https://github.com/mitchellh/terraform-provider-multispace/blob/main/internal/provider/resource_run.go#L106-L109 nil
is simply returned and nothing else is done. Maybe something needs to be done here to update timeouts properly?
I don't know either. I think at least partially, this might be worth asking the Terraform core GitHub as well. I'll do some research here too but it might be useful to have the two threads going in case there is any core (or core SDK) issue.