terraform-plugin-sdk icon indicating copy to clipboard operation
terraform-plugin-sdk copied to clipboard

WaitForState() leaves Refresh() running in the background on timeout

Open enool opened this issue 5 years ago • 1 comments

SDK version

v1.13.1-1

Relevant provider source code

Taken from: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_iam_role.go#L152

var createResp *iam.CreateRoleOutput
err := resource.Retry(30*time.Second, func() *resource.RetryError {
    var err error
    createResp, err = iamconn.CreateRole(request)                <-- Has internally a retry loop, can block more then 30 seconds
    // IAM users (referenced in Principal field of assume policy)
    // can take ~30 seconds to propagate in AWS
    if isAWSErr(err, "MalformedPolicyDocument", "Invalid principal in policy") {
        return resource.RetryableError(err)
    }
    return resource.NonRetryableError(err)
})
if isResourceTimeoutError(err) {                                <-- Goroutine started in Retry (WaitForState) can still be running
    createResp, err = iamconn.CreateRole(request)               <-- Issues another blocking CreateRole
}

Debug Output

[DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details: [DEBUG] [aws-sdk-go] DEBUG: Send Request iam/CreateRole failed, attempt 0/25, error RequestError: send request failed [DEBUG] [aws-sdk-go] DEBUG: Retrying Request iam/CreateRole, attempt 1 [DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details: [WARN] WaitForState timeout after 30s [WARN] WaitForState starting 30s refresh grace period [DEBUG] [aws-sdk-go] DEBUG: Send Request iam/CreateRole failed, attempt 1/25, error RequestError: send request failed [DEBUG] [aws-sdk-go] DEBUG: Retrying Request iam/CreateRole, attempt 2 [DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details: [ERROR] WaitForState exceeded refresh grace period [DEBUG] [aws-sdk-go] DEBUG: Request iam/CreateRole Details:

Expected Behavior

resource.Retry() blocks until callback finishes - even on timeout

Actual Behavior

Callback is still running when timeout happens. For example, AWS provider issues another, parallel, CreateRole(). In many cases this results double creation attempt, and eventually a failure in the plugin.

Error: Error creating IAM Role hello-world-ssm_role: EntityAlreadyExists: Role with name hello-world-ssm_role already exists. status code: 409, request id: removed

on main.tf line 18, in resource "aws_iam_role" "ssm_role": 18: resource "aws_iam_role" "ssm_role"

References

We have been running a crude patch ( #529 ) in production for a few weeks with good results.

enool avatar Aug 11 '20 08:08 enool

I believe I'm seeing this issue as well in the context of the AWS provider (v3.75.2, which I understand uses terraform-plugin-sdk v2.10.1) creating an EC2 instance:

07:01:16.405
2022-07-22T14:00:44.338Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:00:44 [DEBUG] Waiting for state to become: [success]: timestamp=2022-07-22T14:00:44.337Z
07:01:16.522
module.ec2_dev.aws_instance.default[0]: Still creating... [10s elapsed]
07:01:16.582
module.ec2_dev.aws_instance.default[0]: Still creating... [20s elapsed]
07:01:16.644
module.ec2_dev.aws_instance.default[0]: Still creating... [30s elapsed]
07:01:23.998
module.ec2_dev.aws_instance.default[0]: Still creating... [40s elapsed]
07:01:33.999
module.ec2_dev.aws_instance.default[0]: Still creating... [50s elapsed]
07:01:44.000
module.ec2_dev.aws_instance.default[0]: Still creating... [1m0s elapsed]
07:01:54.000
module.ec2_dev.aws_instance.default[0]: Still creating... [1m10s elapsed]
07:02:04.001
module.ec2_dev.aws_instance.default[0]: Still creating... [1m20s elapsed]
07:02:14.002
module.ec2_dev.aws_instance.default[0]: Still creating... [1m30s elapsed]
07:02:24.003
module.ec2_dev.aws_instance.default[0]: Still creating... [1m40s elapsed]
07:02:34.004
module.ec2_dev.aws_instance.default[0]: Still creating... [1m50s elapsed]
07:02:44.004
module.ec2_dev.aws_instance.default[0]: Still creating... [2m0s elapsed]
07:02:44.339
2022-07-22T14:02:44.339Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:02:44 [WARN] WaitForState timeout after 2m0s: timestamp=2022-07-22T14:02:44.338Z
07:02:44.417
2022-07-22T14:02:44.339Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:02:44 [WARN] WaitForState starting 30s refresh grace period: timestamp=2022-07-22T14:02:44.339Z
07:02:54.006
module.ec2_dev.aws_instance.default[0]: Still creating... [2m10s elapsed]
07:03:04.006
module.ec2_dev.aws_instance.default[0]: Still creating... [2m20s elapsed]
07:03:14.008
module.ec2_dev.aws_instance.default[0]: Still creating... [2m30s elapsed]
07:03:14.339
2022-07-22T14:03:14.339Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:14 [ERROR] WaitForState exceeded refresh grace period: timestamp=2022-07-22T14:03:14.339Z
07:03:15.789
2022-07-22T14:03:15.789Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:15 [INFO] Instance ID: ELIDED: timestamp=2022-07-22T14:03:15.789Z
07:03:15.850
2022-07-22T14:03:15.789Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:15 [DEBUG] Waiting for instance (ELIDED) to become running: timestamp=2022-07-22T14:03:15.789Z
07:03:15.966
2022-07-22T14:03:15.789Z [INFO] provider.terraform-provider-aws_v3.75.2_x5: 2022/07/22 14:03:15 [DEBUG] Waiting for state to become: [running]: timestamp=2022-07-22T14:03:15.789Z
07:03:24.010
module.ec2_dev.aws_instance.default[0]: Still creating... [2m40s elapsed]
07:03:27.194
module.ec2_dev.aws_instance.default[0]: Creation complete after 2m45s [id=ELIDED]

From this Terraform run we saw that two EC2 instances were created in our AWS console. 😮 I think that indicates that an EC2 instance creation request was still running when the WaitForState timeout occurred and then a second EC2 instance creation request was submitted, and then both of those requests ultimately succeeded.

I'm hoping that https://github.com/hashicorp/terraform-plugin-sdk/pull/828 (which I understand copies https://github.com/hashicorp/terraform-plugin-sdk/pull/529 but fixes the unit tests) can be reviewed soon.

kconley-sq avatar Jul 25 '22 21:07 kconley-sq