terragrunt icon indicating copy to clipboard operation
terragrunt copied to clipboard

Terragrunt cache bloats the disk usage really fast.

Open fieldawarepiotr opened this issue 7 years ago • 22 comments
trafficstars

Each instance of the terragrunt module creates it's own cache that bloats the disk usage.

$ ls -la
-rw-r--r-- 1 ed kvm  3640 Aug  2 10:18 README.md
-rw-r--r-- 1 ed kvm  1313 Sep  4 16:26 terraform.tfvars
drwx------ 3 ed kvm  4096 Sep  3 10:11 .terragrunt-cache
$ du -sh .terragrunt-cache/
289M    .terragrunt-cache/

Is it possible to use a shared cache that re-uses already downloaded modules (and their versions), so I don't have to download all of the dependencies for each module instantiation? Collectively it is 10GB of cache.

fieldawarepiotr avatar Sep 04 '18 15:09 fieldawarepiotr

There are a few aspects to this:

  1. We want to make debugging easy. We used to download code into a tmp dir or home folder, but that made it tough to find which folders Terragrunt was using. Having everything in the local .terragrunt-cache folder makes it easier to see and figure out what Terragrunt is doing.
  2. However, downloading the full repo every time eats up lots of disk space. Is it possible to download the repo into a common location (e.g., ~/.terragrunt-cache) and symlink it to the local .terragrunt-cache? Does this work properly if you're running apply-all and lots of downloads are happening concurrently? Does this work properly if you are using different versions of a repo? How do we version the repo so you can share code from the same versions but don't mix up code from different versions?
  3. Terraform then downloads the provider binaries. You can reduce the disk space usage here by enabling the terraform provider cache.
  4. Terraform then downloads the modules your .tf files reference. I have no clue if we can do anything to optimize this.

Suggestions on how to improve this are welcome!

brikis98 avatar Sep 05 '18 11:09 brikis98

I would be nice to have an option to auto remove cache after execution. I.e. after apply command.

DenisBY avatar Jul 12 '19 11:07 DenisBY

I just deleted 120GB of .terragrunt-cache. I'm working on multiple environments (17 to be exact, 2 mostly destroyed and left on standby, ~750 modules used across), all are aligned with same versions of modules. Keeping .terragrunt-cache per module is wrong architectural design. terragrunt shouldn't force me to download same repos over and over again. By default it should have common directory and use symlink as @brikis98 said. Having a flag to create local .terragrunt-cache could be an option to debugg (I've never debugged it tho).

I can't imagine supporting and working on infrastructure with 100 clients or more. I would have to either delete local .terragrunt-cache after every apply, forcing me to redownload hundreds of repos or upgrade my SSD (macosx not that easy and not that cheap) with at least 1TB. terragrunt in this form does not scale.

3h4x avatar Oct 13 '19 07:10 3h4x

@3h4x Ideas on how to improve this are welcome, but we need something that explicitly explains how it solves the issues in https://github.com/gruntwork-io/terragrunt/issues/561#issuecomment-418692976.

brikis98 avatar Oct 13 '19 10:10 brikis98

@brikis98 I have few ideas but I'm not sure how complicated and feasible they are. One is symlinks already mentioned, second one is proxy and replacing source with adhoc localhost cache repo. Kinda nasty hack. Unfortunately I don't think I will be able to help to sort this issue.

3h4x avatar Oct 14 '19 07:10 3h4x

Proxy sounds a bit too hacky. Symlinks are more promising, but not without a lot of complexities and gotchas. We're certainly open to PRs that can think through those issues, but for now, periodically clearing the cache as documented here is hopefully a good-enough workaround.

brikis98 avatar Oct 14 '19 11:10 brikis98

I'm also interested in this problem. In my case, we have throttled speed to the git server and a fresh pull every time gets slow really fast.

I may have time to invest in this for a MR if we have a favorable approach.

mateimicu avatar Dec 18 '19 09:12 mateimicu

I may have time to invest in this for a MR if we have a favorable approach.

A PR with a proposal (e.g., just written in a README) that thinks through all the corner cases I mentioned above is welcome!

brikis98 avatar Dec 19 '19 05:12 brikis98

To help with the problem if you are using terraform 0.12 you can add depth=1 as a param to your source path to have terraform only do a shallow clone of the git repo. Especially when combined with the plugin cache mentioned earlier this really cut down my disk space usage.

e.g:

terraform {
  source = "git::https://github.com/lgallard/terraform-aws-cognito-user-pool.git//?ref=0.4.0&depth=1"
}

It's notable that the plugin cache uses hard links at least in some cases so some tools (including du) inflate how much space is used up, notice the inode numbers at the start of this ls output are identical

$ ls -i ~/.terraform.d/plugin_cache/linux_amd64/terraform-provider-aws_v2.60.0_x4 
55312406 /home/jfharden/.terraform.d/plugin_cache/linux_amd64/terraform-provider-aws_v2.60.0_x4
$ ls -i .terragrunt-cache/TmsRQq5jb8Fikqhb4v0N_CPOD8Y/wvSG5F9NOzb3ZsP4sykMWVf-V1c/.terraform/plugins/linux_amd64/terraform-provider-aws_v2.60.0_x4 
55312406 .terragrunt-cache/TmsRQq5jb8Fikqhb4v0N_CPOD8Y/wvSG5F9NOzb3ZsP4sykMWVf-V1c/.terraform/plugins/linux_amd64/terraform-provider-aws_v2.60.0_x4

jfharden avatar May 04 '20 08:05 jfharden

I'm pretty convinced symlinking is going to cause all kinds of trouble, especially with the generators creating provider files etc inside the module directory, but one possible solution which does have some caveats:

The repos could be cloned into a cache directory, something like ~/.terragrunt-cache/modules/github.com/owner/repo.git/<gitref>/ and then you could hardlink instead of symlink. Orchestrating this yourself would be painful, but if you were to rsync the directory you could use the --link-dest option which would deal with all the intricacies, this way you cut the amount of disk space used dramatically if the same module has been cloned more than once, or if the same repo has multiple modules in.

The caveats here are:

  1. It means the OS must have rsync installed (if using rsync to orchestrate)
  2. It doesn't work across filesystem boundaries
  3. I am unsure of the windows support, I know it's possible to get rsync to work on windows, but I suspect it's not as trivial as on linux/osx (where it's usually preinstalled, or a simple apt/yum/brew command away).
  4. It only works on some filesystems (but on all the major default ones now that I know of, ext3, ext4, zfs, NTFS, hfs+)

jfharden avatar May 04 '20 08:05 jfharden

What we ended up doing internally is to create a really small wrapper over terragrunt.

This tool will fetch all sources, clone them with the format ~/path-to-cache-dir-/source-name/<gitref> then apply using this wrapper and use --terragrunt-source to specify the source.

This is highly tailored to our usecase/directory structure and module source :(

mateimicu avatar May 04 '20 19:05 mateimicu

I wonder if leveraging gits reference feature may help us here? Or at least worth exploring... Somehow fetch it once and force all the others to be reference clones.

git clone --reference

https://randyfay.com/content/reference-cache-repositories-speed-clones-git-clone-reference

geota avatar Jul 09 '20 05:07 geota

@jfharden, is depth working for you? For me it only does the first time (when it also creates the terragrunt cache dir) and afterwards it fails due to unrelated git histories. There are a few issues open regarding this on Terraform side, but it seems like nothing is moving there.

https://github.com/hashicorp/terraform/issues/10703

Would it be possible for Terragrunt to delete the local module when it detects a change?

trallnag avatar Jan 22 '21 11:01 trallnag

In the hopes of saving any other poor soul the frustration... I got my Terragrunt cache size from dozens and dozens of GB down to ~350MB by adding these to my environment:

export TERRAGRUNT_DOWNLOAD=$PROJECT_DIR/.terragrunt-cache
export TF_PLUGIN_CACHE_DIR=$TERRAGRUNT_DOWNLOAD/.plugins

Note the manual says that plugin installation from concurrent modules is undefined, but I've not had a problem yet...

nevelis avatar Feb 08 '22 03:02 nevelis

TF_PLUGIN_CACHE_DIR is it concurrency safe?

smitthakkar96 avatar Mar 07 '22 12:03 smitthakkar96

TF_PLUGIN_CACHE_DIR is it concurrency safe?

Nope

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

ryanpodonnell1 avatar May 03 '22 16:05 ryanpodonnell1

TF_PLUGIN_CACHE_DIR is it concurrency safe?

Nope

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

But you can prepopulate it with a provider mirror.

lorengordon avatar May 03 '22 16:05 lorengordon

I'd love a fix for this. Running terragrunt in something like an Azure Container Job isn't possible as there isn't enough storage.

thisispaulsmith avatar Jan 18 '24 09:01 thisispaulsmith

Prepopulating provider cache as suggested by @lorengordon works like a charm, here is an example how we do it in our atlantis

Provider List

❯ tree hack/providers/ -a
hack/providers/
├── aws-4.59.0
│   ├── .terraform.lock.hcl
│   └── versions.tf
├── datadog-3.11.0
│   ├── .terraform.lock.hcl
│   └── versions.tf
├── datadog-3.16.0
│   ├── .terraform.lock.hcl
│   └── versions.tf
├── datadog-3.19.1
│   ├── .terraform.lock.hcl
│   └── versions.tf
├── null-3.0.0
│   ├── .terraform.lock.hcl
│   └── versions.tf
├── opsgenie-0.6.18
│   ├── .terraform.lock.hcl
│   └── versions.tf
├── opsgenie-0.6.20
│   ├── .terraform.lock.hcl
│   └── versions.tf
└── statuspage-1.0.0
    ├── .terraform.lock.hcl
    └── versions.tf

Script to pre-populate cache and generate terragrunt config (atlantis pre-workflow hook)

#!/bin/bash -e

TERRAGRUNT_ATLANTIS_CONFIG="/terragrunt/terragrunt-atlantis-config"

$TERRAGRUNT_ATLANTIS_CONFIG generate --output atlantis.yaml --parallel --create-workspace --automerge --filter='teams/*' --ignore-parent-terragrunt=false

function prepopulateCache() {
    trap "error" ERR

    for dir in hack/providers/*/
    do
        dir=${dir%*/}
        pushd "${dir}"
        terraform providers mirror /atlantis-data/.terraform-cache
        popd
    done
}

function error() {
    popd
    # Try redownloading the provider after wiping the cache
    echo "Wiping the cache and redownloading the provider"
    rm -rf /atlantis-data/.terraform-cache
    prepopulateCache
}

prepopulateCache

smitthakkar96 avatar Jan 18 '24 20:01 smitthakkar96

Prepopulating provider cache as suggested by @lorengordon works like a charm, here is an example how we do it in our atlantis

@smitthakkar96 is my understanding correct that the prepopulation of the cache de-risks the concurrency concerns when using TF_PLUGIN_CACHE_DIR? The assumption being that concurrent terraform init won't mess around with TF_PLUGIN_CACHE_DIR because all providers are already present?

Please let me know if I'm understanding this correctly. Thanks!

phil-relayfi avatar Jan 31 '24 04:01 phil-relayfi

We're looking into disk space and bandwidth usage in https://github.com/gruntwork-io/terragrunt/issues/2920.

brikis98 avatar Jan 31 '24 15:01 brikis98

The assumption being that concurrent terraform init won't mess around with TF_PLUGIN_CACHE_DIR because all providers are already present?

Yes, that is correct, but make sure all provider versions are in the cache and the hashes in the lock file match the provider hashes.

smitthakkar96 avatar Jan 31 '24 15:01 smitthakkar96

Resolved in v0.56.4 release. Make sure to read Provider Caching.

lev-ok avatar Apr 10 '24 16:04 lev-ok

Resolved in v0.56.4 release. Make sure to read Provider Caching.

Nice!

Will try it out later. I noticed it's still labeled as an experimental feature. Anyone encountered any issues?

swissbuechi avatar Jun 21 '24 22:06 swissbuechi

Resolved in v0.56.4 release. Make sure to read Provider Caching.

Nice!

Will try it out later. I noticed it's still labeled as an experimental feature. Anyone encountered any issues?

Perhaps not exactly an issue, but we noticed that memory usage can skyrocket with --terragrunt-provider-cache, especially when using lots of modules. On less complex terragrunt stacks it stayed pretty much the same as before. We run terragrunt in CICD with gitlab-runners on kubernetes and initial testing caused tons of OOMs.

bcha avatar Jul 12 '24 07:07 bcha

Perhaps not exactly an issue, but we noticed that memory usage can skyrocket with --terragrunt-provider-cache, especially when using lots of modules. On less complex terragrunt stacks it stayed pretty much the same as before. We run terragrunt in CICD with gitlab-runners on kubernetes and initial testing caused tons of OOMs.

Most likely, the increased memory usage is due to the simultaneous launch of several Terraforms, this happens after all providers are cached, Terragrunt launches all Terraforms at the same time.

lev-ok avatar Jul 15 '24 16:07 lev-ok

Perhaps not exactly an issue, but we noticed that memory usage can skyrocket with --terragrunt-provider-cache, especially when using lots of modules. On less complex terragrunt stacks it stayed pretty much the same as before. We run terragrunt in CICD with gitlab-runners on kubernetes and initial testing caused tons of OOMs.

Most likely, the increased memory usage is due to the simultaneous launch of several Terraforms, this happens after all providers are cached, Terragrunt launches all Terraforms at the same time.

Yeah that makes sense. Anyway it's a small price to pay, but something worth noting.

bcha avatar Jul 16 '24 05:07 bcha