terraform-provider-aws icon indicating copy to clipboard operation
terraform-provider-aws copied to clipboard

[Bug]: Throttling errors after migrating services to `aws-sdk-go-v2`

Open mlynch1985 opened this issue 2 years ago • 22 comments

Terraform Core Version

1.6.4

AWS Provider Version

5.28.0

Affected Resource(s)

aws_controltower_control

Expected Behavior

The Terraform plan should complete the refresh process successfully without error and allow for the apply stage to execute.

Actual Behavior

The refresh was interrupted due to the throttling errors preventing the plan/apply from completing.

Relevant Error/Panic Output Snippet

Error: reading ControlTower Control (arn:aws:organizations::000000000000:ou/o-abcdefghijk/ou-abcd-efghijklmno,arn:aws:controltower:us-east-1::control/BKEEVLXJOIZI): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
│
│   with module.ct_managed_controls.aws_controltower_control.vpc["BKEEVLXJOIZI/ou-abcd-efghijklmno"],
│   on modules\ct_managed_controls\main.tf line 122, in resource "aws_controltower_control" "api_gateway":
│  122: resource "aws_controltower_control" "api_gateway" {

Terraform Configuration Files

terraform {
  required_version = ">= 1.6.0, < 2.0.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region  = "us-east-1"
}

data "aws_region" "current" {}
data "aws_organizations_organization" "this" {}

data "aws_organizations_organizational_units" "level_one" {
  parent_id = data.aws_organizations_organization.this.roots[0].id
}

data "aws_organizations_organizational_units" "level_two" {
  for_each  = local.level_one_ous
  parent_id = each.value.id
}

#  ...  #

locals {
  level_one_ous = { for ou in data.aws_organizations_organizational_units.level_one.children : ou.name => ou }

  level_two_ous = merge([
    for parent_name, ou in data.aws_organizations_organizational_units.level_two :
    { for child in ou.children : "${parent_name}/${child.name}" => child }
  ]...)

  #  ...  #

  all_ous = merge(local.level_one_ous, local.level_two_ous, local.level_three_ous, local.level_four_ous, local.level_five_ous)
}

locals {
  api_gateway = {
    # [SH.APIGateway.1] API Gateway REST and WebSocket API execution logging should be enabled
    "OOTDCUSIKIZZ" = {
      "${local.all_ous["Deployments"].id}"    = local.all_ous["Deployments"].arn,
      "${local.all_ous["Infrastructure"].id}" = local.all_ous["Infrastructure"].arn,
      "${local.all_ous["Sandbox"].id}"        = local.all_ous["Sandbox"].arn,
      "${local.all_ous["Workloads"].id}"      = local.all_ous["Workloads"].arn
    }

    #  ...  #
  }

  #  ...  #
}

resource "aws_controltower_control" "api_gateway" {
  for_each = merge([for control, ou_map in local.api_gateway :
    { for ou_id, ou_arn in ou_map : "${control}/${ou_id}" => { "control" = control, "ou_arn" = ou_arn } }
  ]...)

  control_identifier = "arn:aws:controltower:${data.aws_region.current.name}::control/${each.value.control}"
  target_identifier  = each.value.ou_arn
}

Steps to Reproduce

Setup AWS Control Tower and copy the above code into main.tf. You will need to create the OU Structure and enable CT Controls to OU associations as it seems to throttle after the initial apply.

Debug Output

No response

Panic Output

No response

Important Factoids

After upgrading to AWS provider v5.28.0 and attempting to execute a plan/apply containing 10+ instances of the "aws_controltower_control" resource, we received throttling errors. When adding a constraint to the provider block to downgrade the AWS provider to <5.28.0 the issue is resolved. Alternatively we can pass in the -refresh=false switch to complete the apply successfully.

References

[Enhancement]: Migrate controltower service to aws-sdk-go-v2

Would you like to implement a fix?

None

mlynch1985 avatar Nov 30 '23 22:11 mlynch1985

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

github-actions[bot] avatar Nov 30 '23 22:11 github-actions[bot]

This seems broader than controltower. I also have this for Athena:

│ Error: reading Athena WorkGroup (...): operation error Athena: GetWorkGroup, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

But also

│ Error: listing tags for Athena WorkGroup (arn:aws:athena:...): operation error Athena: ListTagsForResource, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

There seems to be a related open issue at the aws-sdk-go-v2 repo: https://github.com/aws/aws-sdk-go-v2/issues/1665

I'm on aws provider 5.29.0

RobbertDM avatar Dec 06 '23 15:12 RobbertDM

Relates https://github.com/hashicorp/terraform-provider-aws/issues/34409.

ewbankkit avatar Jan 16 '24 13:01 ewbankkit

We get similar error with GetInlinePolicyForPermissionSet.

Error: reading SSO Permission Set Inline Policy (...): operation error SSO Admin: GetInlinePolicyForPermissionSet, failed to get rate limit token, retry quota exceeded, 3 available, 5 requested

The error exists on aws provider version 5.29.0 and above.

terhirissa avatar Jan 22 '24 07:01 terhirissa

We are also seeing this with CodePipeline Webhook resources on any version above 5.31.0. If we pin our provider version to 5.31.0 it's fine but 5.32.1 and 5.33.0 result in plan failures:

Error: reading CodePipeline Webhook (arn:aws:codepipeline:ca-central-1::webhook:example): 
operation error CodePipeline: ListWebhooks, failed to get rate limit token, retry quota exceeded, 3
 available, 5 requested

neogibson avatar Jan 23 '24 20:01 neogibson

https://github.com/aws/aws-sdk-go-v2/blob/e6eb2ad83b1dad3e9ff7cb22914a5cb70db2c797/aws/retry/standard.go#L231-L246

ewbankkit avatar Jan 25 '24 21:01 ewbankkit

Same: We cannot use this for managing controls due to a "ThrottlingException" resulting from making the API call to "ListEnabledControls".

I just got off the phone with AWS Control Tower folks, who suggested updating the retry logic. It would be great to see a fix for this.

shawnl-kb4 avatar Jan 30 '24 20:01 shawnl-kb4

My thinking on this is to add new provider configuration attribute(s) that will customize the AWS SDK for Go v2 retryer https://github.com/aws/aws-sdk-go-v2/blob/4fce0fdec6c41822255f4c3ec17aa46a9b6e2ac3/aws/retry/standard.go#L160-L171 in particular a RateLimiter with a configurable (different from the default of 500) token bucket size.

ewbankkit avatar Jan 30 '24 22:01 ewbankkit

We are also facing crippling throttling on method ListTagsForResource for aws_config_config_rule resources.

miguelaferreira avatar Jan 31 '24 12:01 miguelaferreira

@mlynch1985 9at al.) Could you please try setting retry_mode = "adaptive" in your provider configuration and see if this helps?

ewbankkit avatar Feb 05 '24 12:02 ewbankkit

@ewbankkit Thanks for the suggestion, setting that on the provider did work in my case, a plan was generated without those rate limit errors. However, on one of our workspaces that consistently plans in ~3 minutes on provider version 5.31.0, this setting seems to have increased the plan time to around 9 minutes on the latest provider version 5.35.0.

neogibson avatar Feb 05 '24 18:02 neogibson

@neogibson Thanks for looking into this. My guess is that we could fine time some of the options to get the behavior closer to AWS SDK for Go v1. The maintainers have this on the agenda to discuss for this week's tech debt review.

ewbankkit avatar Feb 05 '24 19:02 ewbankkit

@mlynch1985 9at al.) Could you please try setting retry_mode = "adaptive" in your provider configuration and see if this helps?

I tested with this option and unfortunately the error is still present.

Error: reading ControlTower Control (arn:aws:organizations::012345678912:ou/o-abcdefghij/ou-abcd-abcdefgh,arn:aws:controltower:us-west-2::control/PBGUIXCOFNGC): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

mlynch1985 avatar Feb 12 '24 21:02 mlynch1985

https://github.com/hashicorp/aws-sdk-go-base/pull/918, incorporated into the Terraform AWS Provider via https://github.com/hashicorp/terraform-provider-aws/pull/35817 should address the failed to get rate limit token, retry quota exceeded errors. As we have not been able to reproduce the throttling errors in our testing we cannot guarantee that all error cases have been dealt with, so I will leave this issue open for comments. The fix will be available in Terraform AWS Provider v5.37.0, likely released tomorrow.

ewbankkit avatar Feb 14 '24 20:02 ewbankkit

Unfortunately we are still seeing this issue even with AWS provider version v5.37.0, with our project which manages controls and their mappings across our organization.

Terraform configuration:

╰─ terraform version
Terraform v1.6.6
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.37.0
+ provider registry.terraform.io/hashicorp/local v2.4.1
terraform plan

...

Planning failed. Terraform encountered an error while generating this plan.

╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_EC2_VOLUME_INUSE_CHECK): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 2 available, 5 requested
│ 
│   with aws_controltower_control.detective["Build_AWS-GR_EC2_VOLUME_INUSE_CHECK"],
│   on main.tf line 66, in resource "aws_controltower_control" "detective":
│   66: resource "aws_controltower_control" "detective" {
│ 
╵
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_RDS_STORAGE_ENCRYPTED): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
│ 
│   with aws_controltower_control.detective["Workloads/Shared_AWS-GR_RDS_STORAGE_ENCRYPTED"],
│   on main.tf line 66, in resource "aws_controltower_control" "detective":
│   66: resource "aws_controltower_control" "detective" {
│ 
╵
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_DETECT_CLOUDTRAIL_ENABLED_ON_MEMBER_ACCOUNTS): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
│ 
│   with aws_controltower_control.detective["DR_AWS-GR_DETECT_CLOUDTRAIL_ENABLED_ON_MEMBER_ACCOUNTS"],
│   on main.tf line 66, in resource "aws_controltower_control" "detective":
│   66: resource "aws_controltower_control" "detective" {
│ 

... and many more similar quote exceeded examples

In an earlier comment: they found v5.31.0 didn't have this issue. It does for us and our project. We are pinned on v5.26.0 until a solution can be found.

jwh-exerp avatar Feb 16 '24 10:02 jwh-exerp

I retested today with TF v.1.7.3 and AWS Provider v5.37.0 but still encountered the same errors. Reverting back to v.5.27.0 continues to be the work around.

mlynch1985 avatar Feb 16 '24 14:02 mlynch1985

I retested with the 5.37.0 update today after experiencing errors with the 5.36.0 provider, reverted back to the 5.35.0 provider as that throws no errors. We're seeing it in CloudWatch: ListTagsForResource & CloudWatch: DescribeAlarms but only on very large states.

sixdaysandy avatar Feb 19 '24 10:02 sixdaysandy

Yeah we're also experiencing this for CodeBuild.

Edit: pinning to 5.27.0 as suggested by @mlynch1985 worked for us. Will test with setting retry_mode.

kieran-lowe avatar Feb 19 '24 12:02 kieran-lowe

FYI, I had some luck changing the provider configuration to include retry_mode = "adaptive" after the update to SDK v2. This resolved the throttling issues I was encountering w/ the Workspaces API.

dthvt avatar Feb 19 '24 20:02 dthvt

For the next pass at a solution, we will add the ability to be able to configure the token bucket capacity for the retry throttling rate limiter (e.g. https://github.com/aws/aws-sdk-go-v2/issues/1665#issuecomment-1469129967). This configured value will be used to initialize the capacity of every API client's token bucket.

ewbankkit avatar Feb 20 '24 19:02 ewbankkit

With the very soon to be released v5.38.0 of the Terraform AWS provider we have added a new provider-level configuration parameter token_bucket_rate_limiter_capacity:

provider "aws" {
  token_bucket_rate_limiter_capacity = 5000
}

which allows the capacity of the rate limiter token bucket to be set. The default is 500 tokens, so if you are experiencing throttling errors then please configure a larger value.

ewbankkit avatar Feb 22 '24 22:02 ewbankkit

I test with the above suggested 5000 and still encountered the error. What is the downside to increasing this value? I don't want to set a ridiculously high number without understanding the potential risks. If it helps, I can setup a code dump so you can test the same code as me.

mlynch1985 avatar Feb 23 '24 20:02 mlynch1985

@mlynch1985 There are no additional resource consumed by increasing the value.

ewbankkit avatar Feb 26 '24 16:02 ewbankkit

@ewbankkit I had to set my provider to 50,000 before it worked, however I was able to complete the plan/apply with this update. I will close this issue now. Thank you!

mlynch1985 avatar Feb 28 '24 22:02 mlynch1985

We are also facing crippling throttling on method ListTagsForResource for aws_config_config_rule resources.

We had issues with this over the last few weeks but today it has started to work again and seems to coincide with the update of provider to v5.41.0

Last provider it worked with was v5.38.0 , since then I've been trying some of the suggested workarounds e.g. retry_mode to adaptive and token_bucket_rate_limiter_capacity to a very large number but neither helped. We'll keep an eye on it.

richgreen-moj avatar Mar 15 '24 10:03 richgreen-moj

Fixed for us in v5.42.0, specifically looks like this fix:

provider: Change the default AWS SDK for Go v2 API client RateLimiter to ratelimit.None so that services migrated to AWS SDK for Go v2 maintain behavioral compatibility with AWS SDK for Go v1 (https://github.com/hashicorp/terraform-provider-aws/issues/36467)

dandelo avatar Mar 28 '24 16:03 dandelo

We still facing some issues related to this, we noticed a strange behavior where the TF plan during the refreshing state of several resources just freezing and halts, time out after 40mins and the state is locked. Anyone faces something similar even with the latest version? We notice this behavior on any version higher than 5.32.

AbAvramidis avatar Apr 04 '24 10:04 AbAvramidis

Thanks for fixing this!

neogibson avatar Apr 10 '24 17:04 neogibson

@AbAvramidis Do you know which services are exhibiting this behavior?

ewbankkit avatar Apr 10 '24 18:04 ewbankkit

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions[bot] avatar May 11 '24 02:05 github-actions[bot]