terragrunt icon indicating copy to clipboard operation
terragrunt copied to clipboard

Issues with *-all commands and terraform plugin cache directory

Open pietro opened this issue 4 years ago • 22 comments

If I set TF_PLUGIN_CACHE_DIR to any directory and use any terragrunt *-all commands fail with Could not satisfy plugin requirements errors. My repro case bellow:

terragrunt.hcl:

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket  = "my-terraform-test-state-test"
    key     = "${path_relative_to_include()}/terraform.tfstate"
    region  = "us-west-2"
    encrypt = true
  }
}

a/terragrunt.hcl:

terraform {
  source = "../example-tf-module"
}

include {
  path = find_in_parent_folders()
}

example-tf-module/main.tf

data "aws_region" "current" {}

output "aws_region" {
  value = data.aws_region.current.name
}

then I create directories b through m and copy a/terragrunt.hcl to them. My final directory tree is:

.
├── a
│   └── terragrunt.hcl
├── b
│   └── terragrunt.hcl
├── c
│   └── terragrunt.hcl
├── d
│   └── terragrunt.hcl
├── e
│   └── terragrunt.hcl
├── example-tf-module
│   └── main.tf
├── f
│   └── terragrunt.hcl
├── g
│   └── terragrunt.hcl
├── h
│   └── terragrunt.hcl
├── i
│   └── terragrunt.hcl
├── j
│   └── terragrunt.hcl
├── k
│   └── terragrunt.hcl
├── l
│   └── terragrunt.hcl
├── m
│   └── terragrunt.hcl
└── terragrunt.hcl

If cd to any of the one letter directories terragrunt validate works fine. From the root dir, with both TF_LOG and TG_LOG set to debug, terragrunt validate-all will fail some of the modules. TF/TG log and TF/TG putput from failed modules:


[terragrunt] [/Users/pietro/tmp/i] 2020/06/05 15:50:40 Running command: terraform validate
2020/06/05 15:50:40 [INFO] Terraform version: 0.12.26
2020/06/05 15:50:40 [INFO] Go runtime version: go1.13.11
2020/06/05 15:50:40 [INFO] CLI args: []string{"/usr/local/bin/terraform", "validate"}
2020/06/05 15:50:40 [DEBUG] Attempting to open CLI config file: /Users/pietro_monteiro/.terraformrc
2020/06/05 15:50:40 [DEBUG] File doesn't exist, but doesn't need to. Ignoring.
2020/06/05 15:50:40 [INFO] CLI command args: []string{"validate"}
2020/06/05 15:50:40 [DEBUG] checking for provider in "."
2020/06/05 15:50:40 [DEBUG] checking for provider in "/usr/local/bin"
2020/06/05 15:50:40 [DEBUG] checking for provider in ".terraform/plugins/darwin_amd64"
2020/06/05 15:50:40 [DEBUG] found provider "terraform-provider-aws_v2.65.0_x4"
2020/06/05 15:50:40 [DEBUG] found valid plugin: "aws", "2.65.0", "/Users/pietro/tmp/i/.terragrunt-cache/6gc2TEE1zxjCqaeQgfJsqVbaD4E/DfnLP98YzbykP1vtM_BhaFk10FU/.terraform/plugins/darwin_amd64/terraform-provider-aws_v2.65.0_x4"
2020/06/05 15:50:40 [DEBUG] checking for provisioner in "."
2020/06/05 15:50:40 [DEBUG] checking for provisioner in "/usr/local/bin"
2020/06/05 15:50:40 [DEBUG] checking for provisioner in ".terraform/plugins/darwin_amd64"
2020/06/05 15:50:40 [TRACE] terraform.NewContext: starting
2020/06/05 15:50:40 [TRACE] terraform.NewContext: resolving provider version selections

Error: Could not satisfy plugin requirements


Plugin reinitialization required. Please run "terraform init".

Plugins are external binaries that Terraform uses to access and manipulate
resources. The configuration provided requires plugins which can't be located,
don't satisfy the version constraints, or are otherwise incompatible.

Terraform automatically discovers provider requirements from your
configuration, including providers used in child modules. To see the
requirements and constraints from each module, run "terraform providers".



Error: provider.aws: new or changed plugin executable


[terragrunt] [/Users/pietro/tmp/h] 2020/06/05 15:50:41 Module /Users/pietro/tmp/h has finished with an error: Hit multiple errors:
exit status 1

Error: Could not satisfy plugin requirements


Plugin reinitialization required. Please run "terraform init".

Plugins are external binaries that Terraform uses to access and manipulate
resources. The configuration provided requires plugins which can't be located,
don't satisfy the version constraints, or are otherwise incompatible.

Terraform automatically discovers provider requirements from your
configuration, including providers used in child modules. To see the
requirements and constraints from each module, run "terraform providers".



Error: provider.aws: new or changed plugin executable


[terragrunt] [/Users/pietro/tmp/i] 2020/06/05 15:50:41 Module /Users/pietro/tmp/i has finished with an error: Hit multiple errors:
exit status 1

Using --terragrunt-parallelism 1 fixes this but it makes my real code super slow to validate/plan/apply. My workaround is to emulate terragrunt init-all --terragrunt-parallelism 1 using bash to terragrunt init each module sequentially.

pietro avatar Jun 05 '20 20:06 pietro

This is because terraform isn't really designed to handle multiple concurrent calls to the binary at once. This leads to issues when all the terraform processes are trying to initialize the plugin cache and download the same versions of the provider (overwrite each other). With that said, this should work as expected once the plugin directory is sufficiently seeded.

Here are two other workarounds for this:

  • Continuously cycle between deleting the terragrunt cache (find . -name ".terragrunt-cache" | xargs rm -r) and running terragrunt validate-all until the plugin cache is seeded.

  • Create a module for the sole purpose of seeding the plugin cache. This module should only have provider blocks with all the versions that you need to use. Then, you can run terragrunt validate just in that module to seed the cache.

Solving this is something we've been thinking about, but we don't have any design for a solution right now.

yorinasub17 avatar Jun 08 '20 23:06 yorinasub17

Terragrunt *-all commands run implicit terraform init if no .terragrunt-cache directory exists. @yorinasub17 could additional parameter --terragrunt-init-parallelism be implemented, so terragrunt would not run terraform init in parallel avoiding this issue?

Very strange that I had pipeline with TF_PLUGIN_CACHE_DIR and terragrunt plan-all command on fresh-spawned VMs, and it worked well for a long time but stopped to work due to this issue a few weeks ago.

askoriy avatar Nov 20 '20 08:11 askoriy

If there is a way to implement --terragrunt-init-parallelism without overcomplicating the pipeline, then that could work. With that said, it could be confusing to have multiple parallelism flags in that fashion.

Side note: I personally would rather invest in a proper dependency management solution. E.g., would be great if you could run terragrunt dep-retrieve which would populate the plugin cache, and also some kind of module cache so reusable modules for the same versions are also shared. It would be more expensive to implement/design, but has high value.

yorinasub17 avatar Nov 25 '20 06:11 yorinasub17

As a quick and maybe quite clean workaround i followed another approach: i quickly wrote a bash wrapper script around terragrunt which basically does only the following:

  1. create a directory for caching of terraform plugins and export it as environment variable TF_PLUGIN_CACHE_DIR
  2. read a .tf file in which all needed plugins are specified
  3. run terraform init and clean up .terraform* files afterwards
  4. finally run terragrunt I think this could be implemented natively into terragrunt or am i wrong and this would solve the drawback of multiple parallel downloads of providers quite neatly by just creating and populating a caching directory beforehand.

Maybe as plan for implementation for the steps:

  • To use this, a parameter --terragrunt-caching could be established which would "activate" all of this
  • A parameter --terragrunt-cache-dir could let one specify the directory in which the cache will be stored. This cache dir could be purged before the run so one could always start with an empty cache. Also, this would shall be exported to the environment so terraform gets aware of all of this
  • A parameter --terragrunt-cache-plugins could get a list of plugins to cache (for example as comma-separated string hashicorp/aws, hashicorp/template, ...). With the terragrunt native generate logic, one could generate a terraform file which only defines terraform { required_providers { ... stuff. Alternatively, the parameter --terragrunt-cache-plugins could be set directly to a terraform file.
  • Then just run terraform init in a temporary directory (for example /tmp/terragrunt-init-dir) or maybe even in the directory specified in the --terragrunt-cache-dir directory. Afterwards clean up the additionally generated files .terraform*
  • Continue as usual

Does this seem realistic? I think this shouldn't be too hard and is quite clean. If yes, the number of needed downloads could be reduced drastically when having dozens of modules if one knows beforehand which plugins are needed.

Zyntogz avatar Dec 18 '20 15:12 Zyntogz

@Zyntogz your trick works because you have your terraform code locally. But if terraform modules are used (source = github.com/...) then *.tf files will not be populated until terragrunt init executed

askoriy avatar Dec 18 '20 15:12 askoriy

I worked around this by creating a cache-directory per plan. This does mean that each plan will have to download providers at least once, but on subsequent runs the cache can be used to fetch the providers. I configure my CI to cache the entire .terraform-plugin-cache directory. I added the following to my top-level terragrunt.hcl:

locals {
  terraform_cache_dir = format("%s/%s", get_env("TF_PLUGIN_CACHE_DIR", "~/.terraform-plugin-cache"), path_relative_to_include())
}

terraform {
  before_hook "provider_cache" {
    commands = ["init", "validate", "plan", "apply"]
    execute  = ["mkdir", "-pv", local.terraform_cache_dir]
  }

  extra_arguments "provider_cache" {
    commands  = ["init", "validate", "plan", "apply"]
    arguments = []

    env_vars = {
      TF_PLUGIN_CACHE_DIR = local.terraform_cache_dir
    }
  }
}

mhulscher avatar Feb 11 '21 09:02 mhulscher

We have some projects with many terragrunt.hcl files (e.g. infrastructure-live repository), and terragrunt *-all executions have started depleting the available disk space in our GitLab CI shared runners. As we don't maintain Terragrunt cache between jobs, a quick workaround for us has been to cleanup the generated Terragrunt cache as each module is processed:

# In the general terragrunt.hcl configuration file.

terraform {
  after_hook "after_delete_terragrunt_cache" {
    commands     = ["validate", "plan", "apply"]
    execute      = ["rm", "-rf", ".terragrunt-cache"]
    working_dir  = "${get_terragrunt_dir()}"
    run_on_error = true
  }
}

Having a centralized TF_PLUGIN_CACHE_DIR directory didn't work for us, when using Terragrunt parallelism, as many times concurrent module executions find partially downloaded providers, and fail.

adamantike avatar Aug 19 '21 14:08 adamantike

@adamantike How are you managing your "output.tfplan"? After adding this section output is deleted every time and the execution is ending with:

 Error: Failed to load "output.tfplan" as a plan file
│
│ Error: stat output.tfplan: no such file or directory

headincl0ud avatar Jun 19 '22 18:06 headincl0ud

So we've also encountered this issue when using terragrunt run-all commands together with plugin-dir.

Here is how we've fixed it :

terraform {
    source = "path/to/tf/module"

    extra_arguments "terraform_args" {
        commands  = ["init"]
        arguments = [
            "-plugin-dir=/path/to/terraform/plugin-cache/"
        ]
    }
}

Hope it'll help you.

Xat59 avatar Jul 28 '22 10:07 Xat59

@headincl0ud, you can either:

  • Run the command specifying -out ... to a path that is not within the .terragrunt-cache folder, so the rm command doesn't delete the generated plans, or
  • Replace execute = ["rm", "-rf", ".terragrunt-cache"] with a command that deletes .terragrunt-cache content excluding *.tfplan files (e.g. using find).

@Xat59, take into account that the approach of centralizing the plugin directory is susceptible to the issue explained in this comment. With parallelism set, and a project with many terragrunt.hcl files, chance for init executions to fail by reading plugins partially downloaded by other parallel executions increase.

adamantike avatar Aug 16 '22 01:08 adamantike

I am using terragrunt together with atlantis and terragrunt-config-generator and hit the same issue. Plugins are already present in the TF_PLUGIN_CACHE_DIR but still in most cases more than 50% of plans fail with errors like this:

Error: Required plugins are not installed

The installed provider plugins are not consistent with the packages
selected in the dependency lock file:
   - registry.terraform.io/hashicorp/google-beta: the cached package for registry.terraform.io/hashicorp/google-beta 4.67.0 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file 

What confuses me, is the versions it looks up in the cache. Here it is google-beta 4.67.0, but in the .terraform.lock.hcl it is fixed to 4.48.0:

provider "registry.terraform.io/hashicorp/google-beta" {
  version     = "4.48.0"
  constraints = "4.48.0"

Is this a separate issue I am encountering here?

norman-zon avatar May 31 '23 09:05 norman-zon

I fixed something like this in #2542. Are you using Terragrunt >=v0.45.12?

geekofalltrades avatar Jun 01 '23 05:06 geekofalltrades

I was on v0.44.5. Upgrading to v0.45.18 fixed the issue. Thank you very much!

norman-zon avatar Jun 01 '23 10:06 norman-zon

@geekofalltrades I'm still seeing this with Terragrunt v0.48.0 and Terraform v1.5.2.

Seeding the plugin cache does not seem to help either, I'm running into this issue after running a terragrunt run-all plan. It seems this for some reason starts re-downloading the same provider that is already installed in the plugin cache.

│ Error: Required plugins are not installed
│ 
│ The installed provider plugins are not consistent with the packages
│ selected in the dependency lock file:
│   - registry.terraform.io/hashicorp/aws: the cached package for registry.terraform.io/hashicorp/aws 5.6.2 (in .terraform/providers) does not match any of the checksums recorded in the dependency lock file
│ 
│ Terraform uses external plugins to integrate with a variety of different
│ infrastructure services. To download the plugins required for this
│ configuration, run:
│   terraform init

I can not find any way to use the plugin cache without terragrunt breaking completely so I guess I'll just have to commit to keeping tens of GB with copies of the same provider library.

albgus avatar Jul 06 '23 12:07 albgus

run-all plan is parallelized, and the cache still doesn't support parallel write. You could try deleting the current cache for and running again with --terragrunt-parallelism 1 (or whatever the flag is). We solve this by having a separate no-op module that requires the union of all the providers we use and just running init on it to warm the cache.

geekofalltrades avatar Jul 06 '23 17:07 geekofalltrades

run-all plan is parallelized, and the cache still doesn't support parallel write. You could try deleting the current cache for and running again with --terragrunt-parallelism 1 (or whatever the flag is). We solve this by having a separate no-op module that requires the union of all the providers we use and just running init on it to warm the cache.

Hi @geekofalltrades,

Terragrunt itself does not install providers, Terraform is responsible for that, and as stated in their official documentation, they do not guarantee safe operation if init happens in parallel.

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

Thus, we cannot influence it in any way.

levkohimins avatar Aug 31 '23 01:08 levkohimins

Resolved in v0.50.11 release.

levkohimins avatar Sep 01 '23 18:09 levkohimins

I am still seeing this issue with 0.50.11

Fomiller avatar Dec 13 '23 20:12 Fomiller

Hi @Fomiller,

The reason may be that Terragrunt does not correctly detect that the cache is used https://github.com/gruntwork-io/terragrunt/blob/eec362e708edf2ba94c76e67caffe0b68110821d/terraform/config.go#L10-L18

Since Terragrunt detects correctly in my test environment, please provide an example to reproduce the issue.

levkohimins avatar Dec 13 '23 20:12 levkohimins

With the following Terraform Version : 1.5.0 Terragrunt Version: 0.50.11 Terragrunt parallelism: 3 TF_PLUGIN_CACHE_DIR = /tmp/.terraform.d/plugin-cache/

When running terragrunt run-all apply --terragrunt-non-interactive I receive the following error

Error: Failed to install provider
Error while installing hashicorp/aws v5.30.0: open
 /tmp/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/5.30.0/linux_amd64/terraform-provider-aws_v5.30.0_x5:
text file busy

my providers declared in my root terragrunt.hcl file look like

terraform {
    required_version = ">=1.3.0"
    required_providers {
        aws = {
            source  = "hashicorp/aws"
            version = ">= 5.0.0"
        }
        template = {
            source  = "hashicorp/template"
            version = "2.2.0"
        }
        random = {
            source  = "hashicorp/random"
            version = "~> 2.3.0"
        }
        null = {
          source = "hashicorp/null"
          version = "3.2.1"
        }
    }
}

The overall file directory structure is very similar to the original issues.

Fomiller avatar Dec 14 '23 01:12 Fomiller

Thank you @Fomiller! I will try to reproduce the issue locally and get back to you.

levkohimins avatar Dec 18 '23 21:12 levkohimins

@Fomiller, I'm sorry to be late with the reply.

The only way at the moment is to run two commands:

  1. run-all init runs terraform init sequentially for all modules, just like with --terragrunt-parallelism 1
  2. Any other command that can/should be executed in parallel.

We are working on the better solution #2920

levkohimins avatar Feb 13 '24 23:02 levkohimins