terragrunt
terragrunt copied to clipboard
Terragrunt apply fails (could not find aws credentails )
Hi All,
We use Terraform and Terragrunt to manage AWS infrastructure. when I run the terragrunt locally it seems fine and no issues in deploying infrastructure but it errors out while deploying through Jenkins as no AWS creds were found and it only happens to some of the folders rest all other services in other folders deploy successfully. it was working fine till a week ago but all of a sudden there is an issue. Not sure what went wrong any suggestions pls?
Previously we used to save .terraform.lock.hcl in SCM along with terragrunt.hcl but we’ve removed in some folders and there is inconsistenyc so we've reinitailised and saved .terraform.lock.hcl in folders. is it causing issues?
Exact Errors
time=2023-09-22T11:41:56Z level=error msg=Module /home/ec2-user/workspace/CI-CD Infrastructure/nft/service-discovery-services has finished with an error: Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?): NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors prefix=[/home/ec2-user/workspace/CI-CD Infrastructure/nft/service-discovery-services]
time=2023-09-22T11:41:59Z level=error msg=Module /home/ec2-user/workspace/CI-CD Infrastructure/nft/rds-config-null-resource has finished with an error: Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?): NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors prefix=[/home/ec2-user/workspace/CI-CD Infrastructure/nft/rds-config-null-resource]
time=2023-09-22T11:42:03Z level=error msg=Module /home/ec2-user/workspace/CI-CD Infrastructure/nft/rds-config-null-resource has finished with an error: Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?): NoCredentialProviders: no valid providers in chain. Deprecated.
locals {
account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
region_vars = read_terragrunt_config(find_in_parent_folders("region.hcl"))
environment_vars = read_terragrunt_config(find_in_parent_folders("environment.hcl"))
account_name = local.account_vars.locals.account_name
account_name_abbr = local.account_vars.locals.account_name_abbr
account_id = local.account_vars.locals.aws_account_id
aws_region = local.region_vars.locals.aws_region
environment_name = local.environment_vars.locals.environment
default_tags = {
Name = local.environment_name
Environment = local.environment_name
Terraform = true
}
}
# Generate an AWS provider block
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.aws_region}"
# version = "= 3.30.0"
# Only these AWS Account IDs may be operated on by this template
allowed_account_ids = ["${local.account_id}"]
# default_tags {
# tags = {
# Name = "${local.environment_name}"
# Environment = "${local.environment_name}"
# Terraform = true
# }
# }
}
EOF
}
# Configure Terragrunt to automatically store tfstate files in an S3 bucket
remote_state {
backend = "s3"
config = {
encrypt = true
bucket = "tfstate-apps-${local.account_id}-${local.aws_region}"
key = "${local.environment_name}/${path_relative_to_include()}/terraform.tfstate"
region = local.aws_region
# dynamodb_table = "terraform-locks"
}
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
}
inputs = merge(
local.account_vars.locals,
local.region_vars.locals,
local.environment_vars.locals,
)
Versions
- Terragrunt version: v0.38.7
- Terraform version:
- Environment details (Ubuntu 20.04, Windows 10, etc.):
Any suggestions please?
Hello, I wanted to confirm if was updated Terragrunt version? or it is the same as before? I suspect that AWS credentials were removed from the env variables used in Jenkins job
Hi denis256,
Terragrunt and terraform remained same version in local machines and jenkins
I don't think AWS credentails removed if it was it shouldn't execute eny modules but some modules are being executed.
I've also been encountering this. Jenkins job does run-all init, validate, plan on many directories in parallel, and some of them (not the same ones, not necessarily at the same points in the process) error out saying there are no credentials. I suspect AWS' behavior has changed (rate-limiting maybe?) because the Terragrunt version hasn't. Trying to see if auto-retry for this error helps now.
Update: auto-retry tuning is dicy. I got it to sometimes work by also setting the number of retries to 5, but occasionally that wasn't enough, so I also increased the delay, and then it started failing the job after only one error. So I haven't been able to come up with a consistent method to avoid this.
Update: auto-retry tuning is dicy. I got it to sometimes work by also setting the number of retries to 5, but occasionally that wasn't enough, so I also increased the delay, and then it started failing the job after only one error. So I haven't been able to come up with a consistent method to avoid this.
What version of terraform and terragrunt are you using?
Recent version of terraform seems to notice this issue they've rolled out update. I tried with latest version aswell still same.
The only way i could reduce the amount of aws creds issue is by executing the shared directory (10 services i.e,) first and then applications directory (which has 10 + folders init with multiple services).
Changing auto-retry doesn't seem to work, which probably is because the error Terragrunt surfaces is its own and not caught from elsewhere? I have:
retry_sleep_interval_sec = 10
retryable_errors = [
# Default list
"(?s).*Failed to load state.*tcp.*timeout.*",
"(?s).*Failed to load backend.*TLS handshake timeout.*",
"(?s).*Creating metric alarm failed.*request to update this alarm is in progress.*",
"(?s).*Error installing provider.*TLS handshake timeout.*",
"(?s).*Error configuring the backend.*TLS handshake timeout.*",
"(?s).*Error installing provider.*tcp.*timeout.*",
"(?s).*Error installing provider.*tcp.*connection reset by peer.*",
"NoSuchBucket: The specified bucket does not exist",
"(?s).*Error creating SSM parameter: TooManyUpdates:.*",
"(?s).*app.terraform.io.*: 429 Too Many Requests.*",
"(?s).*ssh_exchange_identification.*Connection closed by remote host.*",
"(?s).*Client\\.Timeout exceeded while awaiting headers.*",
"(?s).*Could not download module.*The requested URL returned error: 429.*",
# Tests hit erroneous NoCredentialProviders errors because of some kind of rate limiting AWS-side
"(?s).*NoCredentialProviders: no valid providers in chain.*",
]
but it doesn't retry at all.
Contacted AWS support, who told me that they don't publish the throttling/rate limiting numbers because "they're internal" (so, they don't publish the numbers because they don't publish the numbers?) and that Terragrunt should implement a retry with exponential backoff.
The AWS support person indicated that the limit might change at any point, which I suspect means they did recently change it. Experimentally: we've got about 150 modules and we hit a few denials each time; setting TERRAGRUNT_PARALLELISM to 100 seems to prevent the failures, though I haven't got many runs to prove it. UPDATE: no, we see it at 100. I think the limit must be under 70.
@denis256 given the latest information, is there anything that should be looked at from this point?
I will do more tests, but so far I have been thinking about:
- retries when AWS api errors happens
- automatically adjust
TERRAGRUNT_PARALLELISM
(if not configured) based on number of modules
I will do more tests, but so far I have been thinking about:
* retries when AWS api errors happens * automatically adjust `TERRAGRUNT_PARALLELISM` (if not configured) based on number of modules
It would be great help denis, we're facing this issue for a while
Hi, I wanted to check if the issue still appears after upgrade to https://github.com/gruntwork-io/terragrunt/releases/tag/v0.54.13
Hi @denis256
unfortunately no, issue still remain.
It is still complicated on my side to reproduce this issue, I tried to setup something in https://github.com/denis256/terragrunt-tests/tree/master/aws-rate-limit but still not getting the same error as reported.
Will be helpful to share an example repository where this error happens.
Hi denis, Sorry, I don't have any samples to share due to restrictions. I've seen your sample repo I think having multiple modules of rate1 and each has a similar/somewhat different main.tf would generate this issue. I've ~145 modules. Thanks
@denis256
I got this working my issue was resolved by updating the terragurnt and also increasing the RAM. it would be nice that if it highlighted the memory error and also limited the terrragrunt/terraform memory usage.
articles/blogs from online about limiting RAM usage, shows there are good few of them experience this issue because of the modules and providers sizes. the problem is not with the module api calls but with aws provider processes that are heavy because they support lot of aws services at once. In our case for environments like NFT and others which have a lot of resources to deploy which require a lot of provider versions and doing all that stuff at once would require a good bit of RAM. So 8GB would crash terraform.
Any plans in the future to throttle it without breaking it.