terraform-plugin-sdk
terraform-plugin-sdk copied to clipboard
WaitForState() leaves Refresh() running in the background on timeout
SDK version
v1.13.1-1
Relevant provider source code
Taken from: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_iam_role.go#L152
var createResp *iam.CreateRoleOutput
err := resource.Retry(30*time.Second, func() *resource.RetryError {
var err error
createResp, err = iamconn.CreateRole(request) <-- Has internally a retry loop, can block more then 30 seconds
// IAM users (referenced in Principal field of assume policy)
// can take ~30 seconds to propagate in AWS
if isAWSErr(err, "MalformedPolicyDocument", "Invalid principal in policy") {
return resource.RetryableError(err)
}
return resource.NonRetryableError(err)
})
if isResourceTimeoutError(err) { <-- Goroutine started in Retry (WaitForState) can still be running
createResp, err = iamconn.CreateRole(request) <-- Issues another blocking CreateRole
}
Debug Output
[DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details: [DEBUG] [aws-sdk-go] DEBUG: Send Request iam/CreateRole failed, attempt 0/25, error RequestError: send request failed [DEBUG] [aws-sdk-go] DEBUG: Retrying Request iam/CreateRole, attempt 1 [DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details: [WARN] WaitForState timeout after 30s [WARN] WaitForState starting 30s refresh grace period [DEBUG] [aws-sdk-go] DEBUG: Send Request iam/CreateRole failed, attempt 1/25, error RequestError: send request failed [DEBUG] [aws-sdk-go] DEBUG: Retrying Request iam/CreateRole, attempt 2 [DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details: [ERROR] WaitForState exceeded refresh grace period [DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details:
Expected Behavior
resource.Retry() blocks until callback finishes - even on timeout
Actual Behavior
Callback is still running when timeout happens. For example, AWS provider issues another, parallel, CreateRole(). In many cases this results double creation attempt, and eventually a failure in the plugin.
Error: Error creating IAM Role hello-world-ssm_role: EntityAlreadyExists: Role with name hello-world-ssm_role already exists. status code: 409, request id: removed
on main.tf line 18, in resource "aws_iam_role" "ssm_role": 18: resource "aws_iam_role" "ssm_role"
References
We have been running a crude patch ( #529 ) in production for a few weeks with good results.
I believe I'm seeing this issue as well in the context of the AWS provider (v3.75.2, which I understand uses terraform-plugin-sdk v2.10.1) creating an EC2 instance:
07:01:16.405
2022-07-22T14:00:44.338Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:00:44 [DEBUG] Waiting for state to become: [success]: timestamp=2022-07-22T14:00:44.337Z
07:01:16.522
module.ec2_dev.aws_instance.default[0]: Still creating... [10s elapsed]
07:01:16.582
module.ec2_dev.aws_instance.default[0]: Still creating... [20s elapsed]
07:01:16.644
module.ec2_dev.aws_instance.default[0]: Still creating... [30s elapsed]
07:01:23.998
module.ec2_dev.aws_instance.default[0]: Still creating... [40s elapsed]
07:01:33.999
module.ec2_dev.aws_instance.default[0]: Still creating... [50s elapsed]
07:01:44.000
module.ec2_dev.aws_instance.default[0]: Still creating... [1m0s elapsed]
07:01:54.000
module.ec2_dev.aws_instance.default[0]: Still creating... [1m10s elapsed]
07:02:04.001
module.ec2_dev.aws_instance.default[0]: Still creating... [1m20s elapsed]
07:02:14.002
module.ec2_dev.aws_instance.default[0]: Still creating... [1m30s elapsed]
07:02:24.003
module.ec2_dev.aws_instance.default[0]: Still creating... [1m40s elapsed]
07:02:34.004
module.ec2_dev.aws_instance.default[0]: Still creating... [1m50s elapsed]
07:02:44.004
module.ec2_dev.aws_instance.default[0]: Still creating... [2m0s elapsed]
07:02:44.339
2022-07-22T14:02:44.339Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:02:44 [WARN] WaitForState timeout after 2m0s: timestamp=2022-07-22T14:02:44.338Z
07:02:44.417
2022-07-22T14:02:44.339Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:02:44 [WARN] WaitForState starting 30s refresh grace period: timestamp=2022-07-22T14:02:44.339Z
07:02:54.006
module.ec2_dev.aws_instance.default[0]: Still creating... [2m10s elapsed]
07:03:04.006
module.ec2_dev.aws_instance.default[0]: Still creating... [2m20s elapsed]
07:03:14.008
module.ec2_dev.aws_instance.default[0]: Still creating... [2m30s elapsed]
07:03:14.339
2022-07-22T14:03:14.339Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:14 [ERROR] WaitForState exceeded refresh grace period: timestamp=2022-07-22T14:03:14.339Z
07:03:15.789
2022-07-22T14:03:15.789Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:15 [INFO] Instance ID: ELIDED: timestamp=2022-07-22T14:03:15.789Z
07:03:15.850
2022-07-22T14:03:15.789Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:15 [DEBUG] Waiting for instance (ELIDED) to become running: timestamp=2022-07-22T14:03:15.789Z
07:03:15.966
2022-07-22T14:03:15.789Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:15 [DEBUG] Waiting for state to become: [running]: timestamp=2022-07-22T14:03:15.789Z
07:03:24.010
module.ec2_dev.aws_instance.default[0]: Still creating... [2m40s elapsed]
07:03:27.194
module.ec2_dev.aws_instance.default[0]: Creation complete after 2m45s [id=ELIDED]
From this Terraform run we saw that two EC2 instances were created in our AWS console. 😮 I think that indicates that an EC2 instance creation request was still running when the WaitForState timeout occurred and then a second EC2 instance creation request was submitted, and then both of those requests ultimately succeeded.
I'm hoping that https://github.com/hashicorp/terraform-plugin-sdk/pull/828 (which I understand copies https://github.com/hashicorp/terraform-plugin-sdk/pull/529 but fixes the unit tests) can be reviewed soon.