terraform
terraform copied to clipboard
Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently
Terraform Version
1.3.2
Terraform Configuration Files
plugin_cache_dir = "$HOME/.terraform.d/plugin-cache"
Debug Output
no debug
Expected Behavior
Multiple terraform processes should work fine with the same plugin_cache_dir
Actual Behavior
Initializing provider plugins...
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using hashicorp/aws v4.34.0 from the shared cache directory
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using hashicorp/aws v4.34.0 from the shared cache directory
- Using hashicorp/aws v4.34.0 from the shared cache directory
- Using hashicorp/aws v4.34.0 from the shared cache directory
- Installing datadog/datadog v3.16.0... ╷ │ Error: Failed to install provider from shared cache │ │ Error while importing hashicorp/aws v4.34.0 from the shared cache │ directory: the provider cache at .terraform/providers has a copy of │ registry.terraform.io/hashicorp/aws 4.34.0 that doesn't match any of the │ checksums recorded in the dependency lock file.
Steps to Reproduce
I'm using terragrunt, which does initialization of multiple tf stacks at once.
Additional Context
Race condition between two terraform init happens when they are trying to install the same provider same version. First tf calls installFromHTTPURL
and it downloads to a temporary file with random name, but then it calls installFromLocalArchive
and this unpacks directly to global plugins cache directory - this is there race condition occurs.
- Targed dir set to
globalCacheDir
https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L470 - Dir method
InstallPackage
is called here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L482 -
InstallPackage
callsinstallFromHTTPURL
here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/dir_modify.go#L34 -
installFromHTTPURL
downloads archive to temporary file, no possibility for race condition, good: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L56 -
installFromHTTPURL
callsinstallFromLocalArchive
: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L97 -
installFromLocalArchive
starts decompressing directly to final path: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L128
So if another terraform init
happens to see half-unpacked plugin in the middle of step 6 - it will use not ready file.
References
https://github.com/gruntwork-io/terragrunt/issues/1875
Thanks for the report. This behaviour is expected, and described in the plugin_cache_dir
documentation:
Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.
I can't find an enhancement request describing the desired outcome of being able to use multiple instances of terraform init
with a global plugin cache, so I'm retitling and using this one. Another maintainer may be able to link to an existing issue and close this.
@alisdair thanks!
If we do want to change something to make this concurrency-safe then I think a key requirement for us to navigate is ensuring we don't break anyone who currently has their cache directory on a network filesystem where e.g. filesystem-level locking may not be available or may not be reliable.
We didn't explicitly document that the plugin cache directory must be on a filesystem that is only visible to the current kernel, and I've seen questions online in the past which imply that people are already doing that and so I think that existing behavior is de-facto covered by the v1.x Compatibility Promises because when our documentation was ambiguous about something we typically favor keeping existing usage patterns working unless there's a strong reason to break them.
(I don't intend this to mean that we absolutely cannot add an additional restriction here, but if we wish to do that then I think we'll need to justify that the benefit outweighs the cost, and probably also provide a reasonable migration path or backward-compatibility mechanism for those who are relying on our current lack of global locking for the cache directory.)
Hopefully we can instead devise a solution which relies on the atomicity of certain filesystem operations rather than on explicit locking.
For example: perhaps Terraform could initially populate a directory named so that other Terraform processes won't find it, and then move it into its final location using an atomic move/rename system call. If the move/rename fails due to a conflicting directory of the same name, then the process which saw the error can scan the directory that already exists and see if it matches what it was trying to create and treat that as a success if so.
I recall that we originally didn't attempt this because it wasn't clear that we'd be able to provide the same guarantee across all platforms Terraform targets. In particular, I recall learning that Windows has different guarantees about the atomicity of rename operations than Unix-derived kernels typically do. For that reason I expect that a big part of the design effort for this issue will be to determine whether we can rely on some sort of atomic cache add on at least the primary supported OSes: Linux, macOS, and Windows.
The strategy does not necessarily need to be the same on all three platforms because the package directories in the cache are platform-specific anyway, so a process running on one platform should not observe a cache write from a process running on another platform. However, hybrid platforms like Microsoft's WSL might present unique problems if they end up imposing the filesystem guarantees from one platform to code that believes it's running on another platform.
@apparentlymart may be it'll work w/o locking? As installFromHTTPURL
downloads to a temporary file, why installFromLocalArchive
can't do the same? Unpack to a temporary location and then move to final path. I believe this won't break backwards compatibility. Other processes won't see half-unpacked file.
As far as I'm aware, none of the installation methods are truly atomic today: even if we first extract into a temporary directory and then copy into the final location, there is no way to atomically copy a whole directory tree and so there will still be a window of time where another concurrent process can observe a directory that exists but doesn't yet have all of the expected files inside it, or has partial content for one file. In both of those situations the observing process will calculate a checksum from the partial data and so reject the incomplete directory.
I think the goal/requirement here is that at any instant there is either a fully complete and correct package in the cache or no package at all. If we can make that true then we can achieve safe concurrent use without any need for locks.
Linking https://github.com/hashicorp/terraform/issues/25849 which has more history on this request
On Linux (only), renameat()
supports a RENAME_NOREPLACE
flag which can help with atomicity surprises. Windows has MoveFileEx
where not overwriting existing files seems to be the default.
As the plugin cache is only a cache, failure to write any entry should not block continuing with the original operation.
@jbardin over in https://github.com/hashicorp/terraform/issues/32915 I encountered the "text file busy" issue and you pointed to here for an upcoming fix. It happened to me again and I was able to investigate:
....
- Installing hashicorp/aws v4.64.0...
╷
│ Error: Failed to install provider
│
│ Error while installing hashicorp/aws v4.64.0: open
│ /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5:
│ text file busy
╵....
❯ lsof /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
terraform 405002 mkeisler txt REG 259,5 354156544 6486037 /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
....
❯ ps -fp 405002
UID PID PPID C STIME TTY TIME CMD
mkeisler 405002 2809 0 Apr26 ? 00:00:34 .terraform/providers/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
So the short of it is that the provider executable itself is just sitting there spinning. I am using terragrunt, fyi. The parent ID is my /lib/systemd/systemd --user
process (I'm on ubuntu 22).
Edit: verified that this happens using straight terraform without using terragrunt. Verified with terraform v1.4.4
We can work around locking issue; What we can do is to copy files from temporary location to the target directory and create a special file (e.g. active.true
) after copy is done - if the file is present, then we can use this directory as cache source.
It is possible that more than 1 concurrent jobs will create duplicate directories with provider cache; in case we have duplicates, we can just pick the first occurrence, and delete remaining ones.
Also, we will need a process to periodically scan for incomplete cache directories and remove them.
Locks (as files or dirs) do not work.
- Just remember the state of
.terraform/
dir - which files exist and which do not - Init everything async
- If something fails - remove new stuff from
.terraform/
and try once again. - Repeat step 3 a few times before giving up.
Based on the pain of lock implementation in https://github.com/antonbabenko/pre-commit-terraform/pull/620
Also, it just works >5 times quicker, when you init 50+ dirs at once, compared to realization with lock mechanizm
A possibly naive question, but why does terraform init
with plugin_cache_dir
try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider
flag (or similar)?
(We face the issue where an active terraform (plan|apply)
will fail a concurrent terraform init
when the two modules share a plugin_cache_dir
and provider)
A possibly naive question, but why does
terraform init
withplugin_cache_dir
try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a-force-update-provider
flag (or similar)?(We face the issue where an active
terraform (plan|apply)
will fail a concurrentterraform init
when the two modules share aplugin_cache_dir
and provider)
I have the same question. Seems like an odd behavior to overwrite a binary that was already "cached". Kind of defeats the purpose of the cache in the first place.
Not to say that re-downloading existing (vs checking their checksum and comparing with desired one) is probably costing terraform (or github) a non-insignificant amount of traffic)!
For reference, terraform providers lock
from terragrunt (on a non-trivial scenario) takes 15m to process 21 providers (for 5 platforms).
It downloads some of them 2x prob due to parallel processing
If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not.
Would really love to see this fixed in some manner :)
If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not. Would really love to see this fixed in some manner :)
That's already fixed - https://github.com/hashicorp/terraform/pull/34632, it will be GA when 1.8.0 is released, now it exists only in pre-releases for 2 months.