terraform-provider-aws
terraform-provider-aws copied to clipboard
[Bug]: Throttling errors after migrating services to `aws-sdk-go-v2`
Terraform Core Version
1.6.4
AWS Provider Version
5.28.0
Affected Resource(s)
aws_controltower_control
Expected Behavior
The Terraform plan should complete the refresh process successfully without error and allow for the apply stage to execute.
Actual Behavior
The refresh was interrupted due to the throttling errors preventing the plan/apply from completing.
Relevant Error/Panic Output Snippet
Error: reading ControlTower Control (arn:aws:organizations::000000000000:ou/o-abcdefghijk/ou-abcd-efghijklmno,arn:aws:controltower:us-east-1::control/BKEEVLXJOIZI): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
│
│ with module.ct_managed_controls.aws_controltower_control.vpc["BKEEVLXJOIZI/ou-abcd-efghijklmno"],
│ on modules\ct_managed_controls\main.tf line 122, in resource "aws_controltower_control" "api_gateway":
│ 122: resource "aws_controltower_control" "api_gateway" {
Terraform Configuration Files
terraform {
required_version = ">= 1.6.0, < 2.0.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
data "aws_region" "current" {}
data "aws_organizations_organization" "this" {}
data "aws_organizations_organizational_units" "level_one" {
parent_id = data.aws_organizations_organization.this.roots[0].id
}
data "aws_organizations_organizational_units" "level_two" {
for_each = local.level_one_ous
parent_id = each.value.id
}
# ... #
locals {
level_one_ous = { for ou in data.aws_organizations_organizational_units.level_one.children : ou.name => ou }
level_two_ous = merge([
for parent_name, ou in data.aws_organizations_organizational_units.level_two :
{ for child in ou.children : "${parent_name}/${child.name}" => child }
]...)
# ... #
all_ous = merge(local.level_one_ous, local.level_two_ous, local.level_three_ous, local.level_four_ous, local.level_five_ous)
}
locals {
api_gateway = {
# [SH.APIGateway.1] API Gateway REST and WebSocket API execution logging should be enabled
"OOTDCUSIKIZZ" = {
"${local.all_ous["Deployments"].id}" = local.all_ous["Deployments"].arn,
"${local.all_ous["Infrastructure"].id}" = local.all_ous["Infrastructure"].arn,
"${local.all_ous["Sandbox"].id}" = local.all_ous["Sandbox"].arn,
"${local.all_ous["Workloads"].id}" = local.all_ous["Workloads"].arn
}
# ... #
}
# ... #
}
resource "aws_controltower_control" "api_gateway" {
for_each = merge([for control, ou_map in local.api_gateway :
{ for ou_id, ou_arn in ou_map : "${control}/${ou_id}" => { "control" = control, "ou_arn" = ou_arn } }
]...)
control_identifier = "arn:aws:controltower:${data.aws_region.current.name}::control/${each.value.control}"
target_identifier = each.value.ou_arn
}
Steps to Reproduce
Setup AWS Control Tower and copy the above code into main.tf. You will need to create the OU Structure and enable CT Controls to OU associations as it seems to throttle after the initial apply.
Debug Output
No response
Panic Output
No response
Important Factoids
After upgrading to AWS provider v5.28.0 and attempting to execute a plan/apply containing 10+ instances of the "aws_controltower_control" resource, we received throttling errors. When adding a constraint to the provider block to downgrade the AWS provider to <5.28.0 the issue is resolved. Alternatively we can pass in the -refresh=false switch to complete the apply successfully.
References
[Enhancement]: Migrate controltower service to aws-sdk-go-v2
Would you like to implement a fix?
None
Community Note
Voting for Prioritization
- Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
- Please see our prioritization guide for information on how we prioritize.
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
Volunteering to Work on This Issue
- If you are interested in working on this issue, please leave a comment.
- If this would be your first contribution, please review the contribution guide.
This seems broader than controltower. I also have this for Athena:
│ Error: reading Athena WorkGroup (...): operation error Athena: GetWorkGroup, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
But also
│ Error: listing tags for Athena WorkGroup (arn:aws:athena:...): operation error Athena: ListTagsForResource, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
There seems to be a related open issue at the aws-sdk-go-v2 repo:
https://github.com/aws/aws-sdk-go-v2/issues/1665
I'm on aws provider 5.29.0
Relates https://github.com/hashicorp/terraform-provider-aws/issues/34409.
We get similar error with GetInlinePolicyForPermissionSet.
Error: reading SSO Permission Set Inline Policy (...): operation error SSO Admin: GetInlinePolicyForPermissionSet, failed to get rate limit token, retry quota exceeded, 3 available, 5 requested
The error exists on aws provider version 5.29.0 and above.
We are also seeing this with CodePipeline Webhook resources on any version above 5.31.0. If we pin our provider version to 5.31.0 it's fine but 5.32.1 and 5.33.0 result in plan failures:
Error: reading CodePipeline Webhook (arn:aws:codepipeline:ca-central-1::webhook:example):
operation error CodePipeline: ListWebhooks, failed to get rate limit token, retry quota exceeded, 3
available, 5 requested
https://github.com/aws/aws-sdk-go-v2/blob/e6eb2ad83b1dad3e9ff7cb22914a5cb70db2c797/aws/retry/standard.go#L231-L246
Same: We cannot use this for managing controls due to a "ThrottlingException" resulting from making the API call to "ListEnabledControls".
I just got off the phone with AWS Control Tower folks, who suggested updating the retry logic. It would be great to see a fix for this.
My thinking on this is to add new provider configuration attribute(s) that will customize the AWS SDK for Go v2 retryer
https://github.com/aws/aws-sdk-go-v2/blob/4fce0fdec6c41822255f4c3ec17aa46a9b6e2ac3/aws/retry/standard.go#L160-L171
in particular a RateLimiter with a configurable (different from the default of 500) token bucket size.
We are also facing crippling throttling on method ListTagsForResource for aws_config_config_rule resources.
@mlynch1985 9at al.) Could you please try setting retry_mode = "adaptive" in your provider configuration and see if this helps?
@ewbankkit Thanks for the suggestion, setting that on the provider did work in my case, a plan was generated without those rate limit errors. However, on one of our workspaces that consistently plans in ~3 minutes on provider version 5.31.0, this setting seems to have increased the plan time to around 9 minutes on the latest provider version 5.35.0.
@neogibson Thanks for looking into this. My guess is that we could fine time some of the options to get the behavior closer to AWS SDK for Go v1. The maintainers have this on the agenda to discuss for this week's tech debt review.
@mlynch1985 9at al.) Could you please try setting
retry_mode = "adaptive"in your provider configuration and see if this helps?
I tested with this option and unfortunately the error is still present.
Error: reading ControlTower Control (arn:aws:organizations::012345678912:ou/o-abcdefghij/ou-abcd-abcdefgh,arn:aws:controltower:us-west-2::control/PBGUIXCOFNGC): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested
https://github.com/hashicorp/aws-sdk-go-base/pull/918, incorporated into the Terraform AWS Provider via https://github.com/hashicorp/terraform-provider-aws/pull/35817 should address the failed to get rate limit token, retry quota exceeded errors.
As we have not been able to reproduce the throttling errors in our testing we cannot guarantee that all error cases have been dealt with, so I will leave this issue open for comments.
The fix will be available in Terraform AWS Provider v5.37.0, likely released tomorrow.
Unfortunately we are still seeing this issue even with AWS provider version v5.37.0, with our project which manages controls and their mappings across our organization.
Terraform configuration:
╰─ terraform version
Terraform v1.6.6
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.37.0
+ provider registry.terraform.io/hashicorp/local v2.4.1
terraform plan
...
Planning failed. Terraform encountered an error while generating this plan.
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_EC2_VOLUME_INUSE_CHECK): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 2 available, 5 requested
│
│ with aws_controltower_control.detective["Build_AWS-GR_EC2_VOLUME_INUSE_CHECK"],
│ on main.tf line 66, in resource "aws_controltower_control" "detective":
│ 66: resource "aws_controltower_control" "detective" {
│
╵
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_RDS_STORAGE_ENCRYPTED): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
│
│ with aws_controltower_control.detective["Workloads/Shared_AWS-GR_RDS_STORAGE_ENCRYPTED"],
│ on main.tf line 66, in resource "aws_controltower_control" "detective":
│ 66: resource "aws_controltower_control" "detective" {
│
╵
╷
│ Error: reading ControlTower Control (arn:aws:organizations::1234567890123:ou/o-fsdiovxxxx/ou-o2bx-xxxxxxxxx,arn:aws:controltower:eu-central-1::control/AWS-GR_DETECT_CLOUDTRAIL_ENABLED_ON_MEMBER_ACCOUNTS): operation error ControlTower: ListEnabledControls, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
│
│ with aws_controltower_control.detective["DR_AWS-GR_DETECT_CLOUDTRAIL_ENABLED_ON_MEMBER_ACCOUNTS"],
│ on main.tf line 66, in resource "aws_controltower_control" "detective":
│ 66: resource "aws_controltower_control" "detective" {
│
... and many more similar quote exceeded examples
In an earlier comment: they found v5.31.0 didn't have this issue. It does for us and our project. We are pinned on v5.26.0 until a solution can be found.
I retested today with TF v.1.7.3 and AWS Provider v5.37.0 but still encountered the same errors. Reverting back to v.5.27.0 continues to be the work around.
I retested with the 5.37.0 update today after experiencing errors with the 5.36.0 provider, reverted back to the 5.35.0 provider as that throws no errors.
We're seeing it in CloudWatch: ListTagsForResource & CloudWatch: DescribeAlarms but only on very large states.
Yeah we're also experiencing this for CodeBuild.
Edit: pinning to 5.27.0 as suggested by @mlynch1985 worked for us. Will test with setting retry_mode.
FYI, I had some luck changing the provider configuration to include retry_mode = "adaptive" after the update to SDK v2. This resolved the throttling issues I was encountering w/ the Workspaces API.
For the next pass at a solution, we will add the ability to be able to configure the token bucket capacity for the retry throttling rate limiter (e.g. https://github.com/aws/aws-sdk-go-v2/issues/1665#issuecomment-1469129967). This configured value will be used to initialize the capacity of every API client's token bucket.
With the very soon to be released v5.38.0 of the Terraform AWS provider we have added a new provider-level configuration parameter token_bucket_rate_limiter_capacity:
provider "aws" {
token_bucket_rate_limiter_capacity = 5000
}
which allows the capacity of the rate limiter token bucket to be set.
The default is 500 tokens, so if you are experiencing throttling errors then please configure a larger value.
I test with the above suggested 5000 and still encountered the error. What is the downside to increasing this value? I don't want to set a ridiculously high number without understanding the potential risks. If it helps, I can setup a code dump so you can test the same code as me.
@mlynch1985 There are no additional resource consumed by increasing the value.
@ewbankkit I had to set my provider to 50,000 before it worked, however I was able to complete the plan/apply with this update. I will close this issue now. Thank you!
We are also facing crippling throttling on method
ListTagsForResourceforaws_config_config_ruleresources.
We had issues with this over the last few weeks but today it has started to work again and seems to coincide with the update of provider to v5.41.0
Last provider it worked with was v5.38.0 , since then I've been trying some of the suggested workarounds e.g. retry_mode to adaptive and token_bucket_rate_limiter_capacity to a very large number but neither helped. We'll keep an eye on it.
Fixed for us in v5.42.0, specifically looks like this fix:
provider: Change the default AWS SDK for Go v2 API client RateLimiter to ratelimit.None so that services migrated to AWS SDK for Go v2 maintain behavioral compatibility with AWS SDK for Go v1 (https://github.com/hashicorp/terraform-provider-aws/issues/36467)
We still facing some issues related to this, we noticed a strange behavior where the TF plan during the refreshing state of several resources just freezing and halts, time out after 40mins and the state is locked.
Anyone faces something similar even with the latest version?
We notice this behavior on any version higher than 5.32.
Thanks for fixing this!
@AbAvramidis Do you know which services are exhibiting this behavior?
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.