terragrunt icon indicating copy to clipboard operation
terragrunt copied to clipboard

Standalone provider cache server

Open erpel opened this issue 1 year ago • 11 comments

Summary

Starting from the interest in #3231 this is making the case for providing a terragrunt subcommand to start a stand alone provider cache server. In the discussion, creating a RFC is suggested.

Having a single cache server instance serving many parallel terragrunt invocations would improve efficiency on Terraform Automation & Collaboration Software (TACOS) platforms like Atlantis.

Motivation

Organisations running self hosted TACOS platforms or similar services like Atlantis for merge request workflow integration often find themselves with systems that will run terragrunt plan/apply and other commands frequently and in parallel. Existing provider caching has proven lacking for such situations, which lead to terragrunt adding provider caching functionality.

For these systems having a single cache storage location is the most efficient way to use the cache. Past issues have shown that spinning up many cache servers in parallel pointing to the same directories can lead to locking issues among others.

Running a single permanent cache server on the automation system host allows for the most efficient use of cache. All terragrunt processes launched could connect to the same cache server serving providers from a unified cache location without locking issues.

Proposal

Introduce a new subcommand like terragrunt cache-server. This command does nothing but start a cache server and not return unless a fatal error is encountered or signal to stop is received. Server parameters can be set using established terragrunt configuration methods like command line arguments or environment variables. These parameters include cache dir location, registry configuration, listening host/port and the authentication token.

Users of a system like atlantis could add the cache server process to the host/pod running atlantis and extend the atlantis configuration to invoke terragrunt in pipelines with the settings needed to connect to the standalone cache server. This would include enabling caching, providing a server URI and authentication token. Adding these options would differ depending on the actual TACOS used.

Technical Details

  • terragrunt
    • New command added. Expected to mostly hook in to existing functionality.
    • Option to instruct terragrunt to connect to an existing server when needed instead of starting a new one

Press Release

Standalone provider cache server for efficient TACOS hosting

Terragrunt introduces the ability to run a standalone cache server giving TACOS operators more control to ensure efficient reuse of downloaded providers.

A single cache server process enables limitless parallelism enabling operators to scale workflow automation efficiently with minimal overhead.

Standalone cache server is available as of [RELEASE]. To learn more about how to integrate with your selfhosted TACAS, check the documentation.

Drawbacks

Operations for the additional component of the cache server increases the overhead for teams providing TACOS. This includes but is not limited to keeping it up to date and monitoring availability.

Terragrunt might need to improve handling situations where a cache server is supposed to be used but can't be reached, added complexity will complicate troubleshooting in some scenarios.

Sharing a cache server with untrusted entities could be enabled through this and might bring security issues like cache poisoning into the setup.

A long running cache server shared across many terragrunt invocations may increase the requirements for the cache server implementation itself compared over running a server for shorter times with limited scope.

Alternatives

  • Using the cache server as is per terragrunt invocation
  • Falling back on OpenTofu/terraform built in caching options that may improve in the future
  • Implementing some form of provider caching outside of the realm of terragrunt

Migration Strategy

None required

Unresolved Questions

Are there other use cases for this outside of hosting systems that run terragrunt as a service integrated into team workflows (TACOS)?

Do other systems bring additional requirements to be able to integrate a standalone server?

Would running a central server on a network location be a useful scenario? This might introduce many additional security considerations compared to running via localhost only.

References

  • #3234
  • #3292
  • #3076
  • #3325

Proof of Concept Pull Request

No response

Support Level

  • [ ] I have Terragrunt Enterprise Support
  • [ ] I am a paying Gruntwork customer

Customer Name

No response

erpel avatar Aug 26 '24 11:08 erpel

Deleted a comment that was likely an attempt to get folks to download malware. Reported the user to GitHub.

yhakbar avatar Aug 26 '24 14:08 yhakbar

How do you imagine this external cache server is hosted, @erpel ? As a separate container/server with a file system mount to allow for the cached providers to be accessed?

Can you explain why multiple Terragrunt processes are running instead of a run-all invocation? If I'm understanding right, that would result in one Terragrunt process spinning up one go routine for the server, and multiple routines for the underlying OpenTofu/Terraform executions, right?

yhakbar avatar Aug 26 '24 14:08 yhakbar

Thanks for your interest.

In our situation I'd like to add the cache server as a separate container in the Atlantis pod. Configure it to listen on localhost and using an EFS file system mounted in both the main and the cache container at the same path. The question about the file system made me realize that a cache server on a different host makes no sense, as file system access for both sides, cache server and client is required.

Our setup with Atlantis has one instance covering several repositories and some are "monorepos" with many teams working on them. We're not using any of the run-all commands at the moment, our structure is not laid out in a way that makes that immediately useful. Even with that remedied, unrelated MRs would still be running as separate invocations, so terragrunt is likely to always be active multiple times in parallel.

erpel avatar Aug 26 '24 15:08 erpel

We are also looking for this feature but for a different use-case.

When we are running multiple terragrunt commands like this :

terragrunt apply -target 'aws_s3_bucket_policy.bucket_policy[0]'
terragrunt apply -target 'aws_s3_bucket_policy.bucket_policy[1]'

We are waiting for the provider cache to start / stop for each command.

On projects with many providers, starting the provider cache can be slow (~30 seconds). This is quite cumbersome having to wait for the provider cache to start between each commands when it could be re-used between commands.

This can become impossible to use the provider cache if you have a lot of these commands to run.

gnuletik avatar Oct 03 '24 13:10 gnuletik

I +1 all of the above use cases - they all apply to our implementation.

I'll add that even without a TF collaboration framework, a single developer working on a dozen terragrunt modules, planning and testing multiple times, I would have enabled a local cache server on my laptop. Avoid the 2 sec spin-up time X 100 plans I would do that day, it's annoying.

alonalmog82 avatar Oct 30 '24 16:10 alonalmog82

Throwing thoughts out there. Could this just utilise an existing object storage, rather than being an entire new service to host? Like configuration which sets the equivalent of something like:

plugin_cache_dir = "s3://my-cache-bucket/plugin-cache"

The administrator can then choose whether this is an S3 or MinIO bucket, or some K8s volume management service (Longhorn? Not too familiar with K8s)

p5 avatar Nov 01 '24 00:11 p5

P5, using an external object store defeats the purpose of using a local cache. Additionally, this will require setting an additional object store, and force us to handle authentication with it.

alonalmog82 avatar Nov 02 '24 18:11 alonalmog82

So one of the listed alternatives is:

Using the cache server as is per terragrunt invocation

Question: are mutliple https cachce servers running in parallel are thread safe? I understand it is not super efficient to have running several same https proxy servers, but for now I'm more interested if this at least solves the locking problem?

mwos-sl avatar Nov 19 '24 14:11 mwos-sl

Responding to some of the comments shared so far:

@gnuletik , have you tried disabling the server when you're about to run multiple commands on the same unit (docs if this new terminology is confusing)? The provider should be re-used from the .terragrunt-cache directory, meaning you don't even need the provider cache. I know it's not as convenient as having this feature available, but I wanted you to know that you have that option, as we're not prioritizing this right now.

@mwos-sl , it should be safe, as filesystem locking is used to control concurrent access to shared providers rather than anything within the Terragrunt process. Our CI actually runs many different Terragrunt invocations in parallel with multiple provider cache servers running concurrently, so I believe it should be pretty safe. We don't explicitly test for validation of this behavior, though, so I would encourage you to submit an issue or pull request to add more testing if you find that it doesn't work for your use-case.

In general: It seems like there's appetite for this feature, and we want to support the community in using Terragrunt how they need it to work.

To that end, I would like to ask that the following be done for this RFC:

  1. Update the RFC with additional details for the CLI command that would be introduced for this new capability. Take a look, specifically, at this documentation for guidance on how we're trying to standardize the API for the Terragrunt CLI.

    Working out exactly what the command would be called, what the flags would be, what the required values would be are all important to think through now before anything is implemented.

  2. Update the RFC with additional details for the architecture of the provider cache server. Considerations like security, availability, monitoring and UI/UX are all important here.

    It should be clear exactly what a user is going to need to know for the care and feeding of the cache server, and how they're going to manage it.

  3. A PoC of a standalone provider cache server. Given that the maintainers cannot dedicate resources to this right now, please attempt an implementation and submit the in-draft pull request as a PoC that this can be done, and done well.

    If we have that PoC to prove out the concept, we'd be happy to work with the person submitting the PR to get tests, documentation, security, monitoring, etc up to the point where the community can submit an initial implementation of this functionality.

This work doesn't have to all be done by one person. If you are interested in this functionality, please offer to chip in on some of this work so that it doesn't all fall on @erpel . Thanks for your understanding and cooperation.

yhakbar avatar Nov 19 '24 15:11 yhakbar

Thanks for the feedback @yhakbar!

Yes, disabling provider cache can be a workaround in order to run multiple terragrunt commands in a row. I'll use that next time I'll need it.

I'm wondering: if commands like terragrunt apply doesn't need the provider cache to run properly, why not skip the provider cache startup when running these commands?

gnuletik avatar Nov 20 '24 15:11 gnuletik

We have an issue to investigate it here: #3325, but it's not necessarily the case that any terragrunt apply won't benefit from the provider cache server.

For example, say you have a bunch of units in your repository, and each of them fetch a lot of providers.

In this example, you might be including a root terragrunt.hcl configuration that will result in fetching aws, azure, gcp, helm, etc providers.

When doing that, each unit is going to end up downloading all the providers whenever it needs to do anything. So, if you just did a terragrunt run-all apply on everything in your repository, you would probably like to re-use the already downloaded providers when you run terragrunt apply on a particular unit, as the cost of spinning up the provider cache server is probably lower than fetching all those providers.

The scenario you're experiencing is one where you have multiple applies that need to happen on the same unit, so you are definitely going to have the provider cached on subsequent applies, which might make it worth the cost of the initial provider fetch.

One solution, which occurred to me while writing this is to evaluate whether a terragrunt apply is being done instead of a terragrunt run-all apply for a unit, and decide to only set TF_PLUGIN_CACHE_DIR as a consequence. That would result in accessing the shared cache used by the provider cache server without spinning up the server itself.

As I commented in the linked issue, that's also problematic because the unit might have dependencies that need to fetch the same providers. Users would still be at risk of a race condition between multiple OpenTofu runs without a provider cache server to mediate the provider fetch.

e.g. Unit baz depends on unit foo and bar. All of the units require the aws provider. When running terragrunt apply on baz, both foo and bar try to fetch the aws provider at the same time and explode.

yhakbar avatar Nov 20 '24 16:11 yhakbar

If what you want a standalone cache server for is to mitigate Terraform's lack of a concurrent-safe plugin cache, you can fix that with OverlayFS as explained here. Hope it helps!

ricardbejarano avatar Apr 16 '25 15:04 ricardbejarano

Note: with the release of OpenTofu 1.10, there is a locking mechanism for providers, which helps a lot. However:

  1. We've seen some failures acquiring the lock
  2. for modules without a .terragrunt-cache and without a .terraform.hcl.lock files, the cache server allows to "download" providers from local, which is obviously WAY FASTER than downloading them from internet. So the cache server is still a great addition.

But, as explained in this thread: enabling cache server and running many plan/apply (while writing new units or modules or debugging stuff), starting the cache server every time is inefficient.

=> So having a long running cache server would be great, as it would cover this ticket and a few associated ones.

jgournet avatar Jun 26 '25 22:06 jgournet

We'll be introducing experimental support for automatically configuring the OpenTofu provider cache directory. If you want to try it out, you can use this alpha release: https://github.com/gruntwork-io/terragrunt/releases/tag/alpha-20250626

Could you share what problems you've experienced while trying to acquire the lock in your testing, @jgournet ? Also, why not have the lock files committed? That way you don't have to "download" them from anywhere, you can just ensure that you have access to the cache directory (e.g. ~/.cache/terragrunt/providers on macOS).

yhakbar avatar Jun 26 '25 22:06 yhakbar

Error found:

13:26:33.227 STDERR [_label] tofu: │ Error while installing hashicorp/aws v6.0.0: unable to acquire file lock on
13:26:33.227 STDERR [_label] tofu: │ "../../../../.tf-plugins/registry.opentofu.org/hashicorp/aws/6.0.0/linux_amd64.lock":
13:26:33.228 STDERR [_label] tofu: │ resource temporarily unavailable

As for not committing the lock files: we tried in the past, but that we found it was an additional burden; so we ended up removing them from source code

jgournet avatar Jun 26 '25 23:06 jgournet

Interesting. Are you able to reliably reproduce the error? Are there conditions under which you think I'd be able to reproduce the issue?

yhakbar avatar Jun 26 '25 23:06 yhakbar

Interesting. Are you able to reliably reproduce the error? Are there conditions under which you think I'd be able to reproduce the issue?

Arg, good questions ! and no: it only happened to one of us, once ,when doing a massive "init" across many units.

This lead us to try again the cache server, with very good results - except for the kind of long startup time, which lead me to this ticket. So at the moment, we're trying to find ways to enable the cache server only when working with many units (or hoping this ticket gets implemented :) )

I've just read the alpha release you mentioned above, and now understand your questions: the error above happened once, with:

export TF_PLUGIN_CACHE_DIR=${HOME}/.tfplugins

knowing we have a "before_hook" that does mkdir -p $TF_PLUGIN_CACHE_DIR

I guess I should probably follow it up with a ticket in tofu to get this addressed - then again, we have seen this only once so far

jgournet avatar Jun 26 '25 23:06 jgournet

Can you quantity "massive number of units"? I'd like to know at what scale users might be advised to use the provider cache server or if I can help diagnose any locking issues in OpenTofu.

yhakbar avatar Jun 26 '25 23:06 yhakbar

Can you quantity "massive number of units"? I'd like to know at what scale users might be advised to use the provider cache server or if I can help diagnose any locking issues in OpenTofu.

you're right, massive is very subjective: it's actually not that big

find . -name terragrunt.hcl | wc -l
122

(it feels big for us, and for our laptops)

then again: it's a one-time issue so far ... pretty sure people with way bigger architectures will report issues if they find any ?

jgournet avatar Jun 26 '25 23:06 jgournet

And how many providers are you working with, on average across these units? Just one or two?

yhakbar avatar Jun 26 '25 23:06 yhakbar

from 2/3 providers, up to 7-8 per unit

jgournet avatar Jun 26 '25 23:06 jgournet

@yhakbar does TerraGrunt generate the lockfiles automatically when using OpenTofu's builtin provider cache?

cam72cam avatar Jul 30 '25 12:07 cam72cam

@cam72cam , Terragrunt shouldn't do any manual lockfile manipulation when working with OpenTofu's builtin provider cache. It should simply init in the root module it sets up, then copy the lockfile back out of the .terragrunt-cache directory if necessary.

yhakbar avatar Jul 30 '25 12:07 yhakbar

Ok, that would make sense of these failures. OpenTofu will re-download the provider if it's not already in the lockfile, regardless of what's in the global provider cache.

By enabling this cache on projects without lockfiles, it effectively serializes provider installation.

cam72cam avatar Jul 30 '25 13:07 cam72cam

OpenTofu 1.10.5 has been released which should resolve this issue.

If you are not using a lockfile with TF_PLUGIN_CACHE_DIR, please take a look at https://github.com/opentofu/opentofu/pull/3078

cam72cam avatar Aug 01 '25 18:08 cam72cam

@jgournet have you had a chance to continue experimenting after the release of OpenTofu 1.10.5?

We're looking to stabilize the Auto Provider Cache Dir feature by the end of the quarter if possible, and feedback from the community that there aren't scaling limitations, etc that require manual intervention would be great. Once stable, all Terragrunt users (that are also OpenTofu 1.10.5+ users) will automatically have the provider cache directory set by default, so we want to make sure that we get positive signal from the community before flipping that switch.

We'll leave this issue open to continue to explore the possibility of a standalone provider cache server for users with network attached filesystems, etc. that can't leverage the OpenTofu Provider Cache Directory.

yhakbar avatar Aug 12 '25 18:08 yhakbar

@yhakbar : I'll give it a try - thanks for the work on this !

jgournet avatar Aug 12 '25 22:08 jgournet

To be clear, we are all thanking @cam72cam for the hard work he and the OpenTofu core team have been doing to enable this functionality.

yhakbar avatar Aug 12 '25 22:08 yhakbar

Trying to plan for ~139 terragrunt files; biggest group having about ~40 units:

Image

at least, it does not crash anymore, but that's not a great experience either :)

jgournet avatar Aug 12 '25 23:08 jgournet

As for not committing the lock files: we tried in the past, but that we found it was an additional burden; so we ended up removing them from source code

If you do not have lock files, the provider cache ~server~directory does not provide any benefits as of today.

cam72cam avatar Aug 14 '25 12:08 cam72cam