terraform icon indicating copy to clipboard operation
terraform copied to clipboard

Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently

Open amkartashov opened this issue 1 year ago • 10 comments

Terraform Version

1.3.2

Terraform Configuration Files

plugin_cache_dir   = "$HOME/.terraform.d/plugin-cache"

Debug Output

no debug

Expected Behavior

Multiple terraform processes should work fine with the same plugin_cache_dir

Actual Behavior

Initializing provider plugins...

  • Reusing previous version of hashicorp/aws from the dependency lock file
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Reusing previous version of hashicorp/aws from the dependency lock file
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Installing datadog/datadog v3.16.0... ╷ │ Error: Failed to install provider from shared cache │ │ Error while importing hashicorp/aws v4.34.0 from the shared cache │ directory: the provider cache at .terraform/providers has a copy of │ registry.terraform.io/hashicorp/aws 4.34.0 that doesn't match any of the │ checksums recorded in the dependency lock file.

Steps to Reproduce

I'm using terragrunt, which does initialization of multiple tf stacks at once.

Additional Context

Race condition between two terraform init happens when they are trying to install the same provider same version. First tf calls installFromHTTPURL and it downloads to a temporary file with random name, but then it calls installFromLocalArchive and this unpacks directly to global plugins cache directory - this is there race condition occurs.

  1. Targed dir set to globalCacheDir https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L470
  2. Dir method InstallPackage is called here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L482
  3. InstallPackage calls installFromHTTPURL here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/dir_modify.go#L34
  4. installFromHTTPURL downloads archive to temporary file, no possibility for race condition, good: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L56
  5. installFromHTTPURL calls installFromLocalArchive: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L97
  6. installFromLocalArchive starts decompressing directly to final path: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L128

So if another terraform init happens to see half-unpacked plugin in the middle of step 6 - it will use not ready file.

References

https://github.com/gruntwork-io/terragrunt/issues/1875

amkartashov avatar Oct 07 '22 12:10 amkartashov

Thanks for the report. This behaviour is expected, and described in the plugin_cache_dir documentation:

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

I can't find an enhancement request describing the desired outcome of being able to use multiple instances of terraform init with a global plugin cache, so I'm retitling and using this one. Another maintainer may be able to link to an existing issue and close this.

alisdair avatar Oct 07 '22 13:10 alisdair

@alisdair thanks!

amkartashov avatar Oct 07 '22 15:10 amkartashov

If we do want to change something to make this concurrency-safe then I think a key requirement for us to navigate is ensuring we don't break anyone who currently has their cache directory on a network filesystem where e.g. filesystem-level locking may not be available or may not be reliable.

We didn't explicitly document that the plugin cache directory must be on a filesystem that is only visible to the current kernel, and I've seen questions online in the past which imply that people are already doing that and so I think that existing behavior is de-facto covered by the v1.x Compatibility Promises because when our documentation was ambiguous about something we typically favor keeping existing usage patterns working unless there's a strong reason to break them.

(I don't intend this to mean that we absolutely cannot add an additional restriction here, but if we wish to do that then I think we'll need to justify that the benefit outweighs the cost, and probably also provide a reasonable migration path or backward-compatibility mechanism for those who are relying on our current lack of global locking for the cache directory.)


Hopefully we can instead devise a solution which relies on the atomicity of certain filesystem operations rather than on explicit locking.

For example: perhaps Terraform could initially populate a directory named so that other Terraform processes won't find it, and then move it into its final location using an atomic move/rename system call. If the move/rename fails due to a conflicting directory of the same name, then the process which saw the error can scan the directory that already exists and see if it matches what it was trying to create and treat that as a success if so.

I recall that we originally didn't attempt this because it wasn't clear that we'd be able to provide the same guarantee across all platforms Terraform targets. In particular, I recall learning that Windows has different guarantees about the atomicity of rename operations than Unix-derived kernels typically do. For that reason I expect that a big part of the design effort for this issue will be to determine whether we can rely on some sort of atomic cache add on at least the primary supported OSes: Linux, macOS, and Windows.

The strategy does not necessarily need to be the same on all three platforms because the package directories in the cache are platform-specific anyway, so a process running on one platform should not observe a cache write from a process running on another platform. However, hybrid platforms like Microsoft's WSL might present unique problems if they end up imposing the filesystem guarantees from one platform to code that believes it's running on another platform.

apparentlymart avatar Oct 10 '22 19:10 apparentlymart

@apparentlymart may be it'll work w/o locking? As installFromHTTPURL downloads to a temporary file, why installFromLocalArchive can't do the same? Unpack to a temporary location and then move to final path. I believe this won't break backwards compatibility. Other processes won't see half-unpacked file.

amkartashov avatar Oct 11 '22 10:10 amkartashov

As far as I'm aware, none of the installation methods are truly atomic today: even if we first extract into a temporary directory and then copy into the final location, there is no way to atomically copy a whole directory tree and so there will still be a window of time where another concurrent process can observe a directory that exists but doesn't yet have all of the expected files inside it, or has partial content for one file. In both of those situations the observing process will calculate a checksum from the partial data and so reject the incomplete directory.

I think the goal/requirement here is that at any instant there is either a fully complete and correct package in the cache or no package at all. If we can make that true then we can achieve safe concurrent use without any need for locks.

apparentlymart avatar Oct 11 '22 14:10 apparentlymart

Linking https://github.com/hashicorp/terraform/issues/25849 which has more history on this request

joe-a-t avatar Oct 18 '22 14:10 joe-a-t

On Linux (only), renameat() supports a RENAME_NOREPLACE flag which can help with atomicity surprises. Windows has MoveFileEx where not overwriting existing files seems to be the default.

As the plugin cache is only a cache, failure to write any entry should not block continuing with the original operation.

sftim avatar Nov 21 '22 17:11 sftim

@jbardin over in https://github.com/hashicorp/terraform/issues/32915 I encountered the "text file busy" issue and you pointed to here for an upcoming fix. It happened to me again and I was able to investigate:

....
- Installing hashicorp/aws v4.64.0...
╷
│ Error: Failed to install provider
│ 
│ Error while installing hashicorp/aws v4.64.0: open
│ /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5:
│ text file busy
╵....
❯ lsof /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
COMMAND      PID     USER  FD   TYPE DEVICE  SIZE/OFF    NODE NAME
terraform 405002 mkeisler txt    REG  259,5 354156544 6486037 /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
....
❯ ps -fp 405002
UID          PID    PPID  C STIME TTY          TIME CMD
mkeisler  405002    2809  0 Apr26 ?        00:00:34 .terraform/providers/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5

So the short of it is that the provider executable itself is just sitting there spinning. I am using terragrunt, fyi. The parent ID is my /lib/systemd/systemd --user process (I'm on ubuntu 22).

Edit: verified that this happens using straight terraform without using terragrunt. Verified with terraform v1.4.4

grimm26 avatar Apr 27 '23 19:04 grimm26

We can work around locking issue; What we can do is to copy files from temporary location to the target directory and create a special file (e.g. active.true) after copy is done - if the file is present, then we can use this directory as cache source.

It is possible that more than 1 concurrent jobs will create duplicate directories with provider cache; in case we have duplicates, we can just pick the first occurrence, and delete remaining ones.

Also, we will need a process to periodically scan for incomplete cache directories and remove them.

ns-vpanfilov avatar Jun 16 '23 16:06 ns-vpanfilov

Locks (as files or dirs) do not work.

  1. Just remember the state of .terraform/ dir - which files exist and which do not
  2. Init everything async
  3. If something fails - remove new stuff from .terraform/ and try once again.
  4. Repeat step 3 a few times before giving up.

Based on the pain of lock implementation in https://github.com/antonbabenko/pre-commit-terraform/pull/620

Also, it just works >5 times quicker, when you init 50+ dirs at once, compared to realization with lock mechanizm

MaxymVlasov avatar Feb 13 '24 00:02 MaxymVlasov

A possibly naive question, but why does terraform init with plugin_cache_dir try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider flag (or similar)?

(We face the issue where an active terraform (plan|apply) will fail a concurrent terraform init when the two modules share a plugin_cache_dir and provider)

isometry avatar Mar 19 '24 14:03 isometry

A possibly naive question, but why does terraform init with plugin_cache_dir try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider flag (or similar)?

(We face the issue where an active terraform (plan|apply) will fail a concurrent terraform init when the two modules share a plugin_cache_dir and provider)

I have the same question. Seems like an odd behavior to overwrite a binary that was already "cached". Kind of defeats the purpose of the cache in the first place.

ranesagar avatar Mar 27 '24 19:03 ranesagar

Not to say that re-downloading existing (vs checking their checksum and comparing with desired one) is probably costing terraform (or github) a non-insignificant amount of traffic)! For reference, terraform providers lock from terragrunt (on a non-trivial scenario) takes 15m to process 21 providers (for 5 platforms). It downloads some of them 2x prob due to parallel processing If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not. Would really love to see this fixed in some manner :)

joaocc avatar Mar 27 '24 20:03 joaocc

If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not. Would really love to see this fixed in some manner :)

That's already fixed - https://github.com/hashicorp/terraform/pull/34632, it will be GA when 1.8.0 is released, now it exists only in pre-releases for 2 months.

MaxymVlasov avatar Mar 28 '24 12:03 MaxymVlasov