terraform icon indicating copy to clipboard operation
terraform copied to clipboard

Unpack TF plugins in a more atomic way.

Open mplzik opened this issue 2 years ago • 29 comments
trafficstars

The original implementation of plugin cache handling unpacks plugin archives in a way that can result in race conditions when multiple terraform processes are accessing the same plugin cache. Since the archives are being decompressed in chunks and output files are written directly to the cache, we observed following manifestations of the issue:

  • text file busy errors if other terraform processes are already running the plugin and another process tries to replace it.
  • various plugin checksum errors triggered likely by simultaneous checksumming and rewriting the file.

This PR changes the zip archives with plugins are handled:

  1. Instead of writing directly to the target directory, installFromLocalArchive is now writing to a temporary staging directory prefixed with.temp that resides in the plugin cache dir to ensure this is on the same filesystem.
  2. After unpacking, the directory structure of the staging directory is replicated in the targetDir. The directories are created as needed and any files are moved using os.Rename. After this, the staging directory is removed.
  3. Since the temporary staging directories can be picked up by SearchLocalDirectory and make it fail in the situation when they're removed during directory traversal, the function has been modified to skip any entry that starts with dot.

Signed-off-by: Milan Plzik [email protected]

Fixes #31964

Target Release

1.5.x

Draft CHANGELOG entry

BUG FIXES

  • Concurrent write access to plugin cache directory has been fixed.

mplzik avatar Jul 06 '23 09:07 mplzik

CLA assistant check
All committers have signed the CLA.

hashicorp-cla avatar Jul 06 '23 09:07 hashicorp-cla

Thanks for this submission, I will raise it in triage.

crw avatar Jul 06 '23 20:07 crw

Oh please this would make me so happy, thanks

brandon-fryslie avatar Jul 28 '23 23:07 brandon-fryslie

Thanks for the reminder, and thanks @mplzik again for the submission.

The review from triage is that this change would require more due diligence. One issue with this implementation is that it relies on os.rename being atomic, which it is not. This may make the concurrency issue worse on different platforms.

We also noticed that the documentation should call out more clearly that the plugin cache is not concurrency-safe.

I'll leave this PR open for discussion for the moment, but will likely close it after a time as the feeling was it would be better to redesign a concurrency-safe plugin cache, which would require more discussion and prioritization.

Thanks again for this submission!

crw avatar Jul 31 '23 18:07 crw

@crw I definitely agree that this is not solution to all the problems -- hence my comment // On supported platforms, this should perform atomic replacement of the file., with supported meaning platforms where this is supported for os.Rename. Speaking of this, would hiding this change behind a (default off) flag would make this PR more acceptable? I understand that implementing a really atomic cache would be preferred, but likely also entails much bigger engineering time investment (and it hasn't been implemented for a longer time, suggesting it might not be trivial).

Let me also do a bit of an argument in favor of merging this PR from the risk assessment side.

The golang documentation is not going too far on describing os.Rename non-atomicity, but https://stackoverflow.com/questions/30385225/is-there-an-os-independent-way-to-atomically-overwrite-a-file goes a bit farther in describing the platform-specific behavior. Specifically (assuming files on the same volume):

  • On Linux, this operations should be atomic (as much as the underlying filesystems can guarantee), translating into Renameat syscall
  • On Windows, the golang implementation uses MoveFileEx, which according to information available is trying to do atomic replace if applicable, resorting to delete/copy operation if this is not supported. (see https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/449bb49d-8acc-48dc-a46f-0760ceddbfc3/movefileexmovefilereplaceexisting-ntfs-same-volume-atomic, Doug's answer)
  • On MacOS, this translates to Rename syscall which is supposed to be atomic.

I see two main differences:

  • The renaming approach uses more disk space temporarily, so some real-world setups that are on almost full disks might be tipped to exceeding available disk space. This might, however, happen also with increasing size of the providers.
  • In the windows case, the file might be deleted (and thus stop existing) for a short period of time if atomic rename is not supported. I believe this will trigger mostly the same issues that are manifesting now (although I can't provide more data to prove that -- our infrastructure is not using any Windows machines to run Terraform).

mplzik avatar Aug 04 '23 07:08 mplzik

Hi. Just some 0.02€ from userland. We (and many others using terraform directly or via wrappers - in my case terragrunt) are having a terrible and very painfull time due to the concurrency issues, stopping us from moving ahead, and causing uncertainty in processes which should be quite trivial. I would tend to agree with @mplzik in that, even if this is not 100% bulletprof, the fact that it is supposed to work much better on linux and (from what I understood) on Windows, means that a huge number of users should have their experience improved, and get back to using terraform in a productive and streamlined manner. If in the future there is an "even better way" (TM) to improve the cache in more complex scenarios (like network file systems), then it would be also very welcomed, of course. But in the meantime, the benefits would already be rolled out. Thanks

joaocc avatar Aug 24 '23 22:08 joaocc

I'm failing to understand the rationale behind blocking a small, simple change that would fix the problem for the vast majority of users in favor of holding out for an unknown, unplanned bulletproof solution that (likely) no one is working on. I'm finding it difficult to believe this could have been fixed for months already and yet it just sits here getting stale.

You can always revert the PR when you implement the really badass perfect plugin cache that you're now committed to implementing. And if you're not committed to it, why block this? Can we at least get a timeline of when you expect to release the real fix?

brandon-fryslie avatar Dec 12 '23 06:12 brandon-fryslie

First, apologies to @mplzik for the late response on this, the team appreciates your thoughtful response on this PR. The team is still concerned about edge cases with this solution. Speaking based on my own observations of how the maintainer team considers PRs, I think this is a difficult section of code to change from outside due to plugin cache management being in the critical path for every user.

@brandon-fryslie, to answer your question, the rationale is that the primary path to run Terraform is in sequence, not in parallel. The plugin cache is not concurrency safe. You are correct to assume no one is working on a more robust solution, as making the plugin cache safe for concurrent execution of Terraform is not currently a priority for the team. I'll re-raise this PR with our product manager to see if concurrency-safety for the plugin cache is something we can prioritize. Given that Terraform, as a whole, is not meant to be run in parallel, this likely has considerations beyond the plugin cache. However, it is always possible we will need to prioritize this as we add new features going forward.

Thanks for your feedback and continued interest in this feature.

crw avatar Dec 12 '23 22:12 crw

the primary path to run Terraform is in sequence, not in parallel.

Just to chime in - running Terraform in parallel (for different projects that use the same modules) is a huge part of our CI/CD workflow.

expnch avatar Dec 12 '23 23:12 expnch

@crw I can see "waiting-response" label added. What kind of response from the community is expected for this one to be merged? Is it only about this comment: https://github.com/hashicorp/terraform/pull/33479#discussion_r1462982768?

Asking because if we're close to merge, then I guess we can find community support to finish this.

wosiu avatar Nov 19 '24 13:11 wosiu

@wosiu: @liamcervante, who is a maintainer of Terraform, left a comment on the code. As far as I can tell, the author has not returned to comment on or change that section of code.

crw avatar Nov 19 '24 20:11 crw

Hello there. The TF use-case I was dealing with when proposing this PR was tied to Atlantis. Meanwhile https://github.com/runatlantis/atlantis/pull/3720 got merged, allowing us to work around this issue from different angle (and empirically, we have not noticed this issue since deploying the Atlantis change). As for the Atlantis use-case, I think this PR is no longer necessary.

If there is any interest in having this merged, I can try to give this PR a shot, but to be totally frank, my interest to contribute decreased significantly after the TF licensing changes -- hence the long delay and PR not updated.

mplzik avatar Nov 20 '24 09:11 mplzik

Thanks for the update, @mplzik, and I understand your concerns.

crw avatar Nov 20 '24 19:11 crw

Given that update, I discussed with @liamcervante. If someone else in the community wants to pick this up and address Liam's comments (or if you can convince @mplzik to do it :), that submission would be considered for review. Thanks!

crw avatar Nov 20 '24 19:11 crw

My primary concern -- once merged, would this/similar patch be applicable to OpenTofu without any legal consequences? I'm not a lawyer and thus, don't feel keen enough to dive into details of license compatibility.

mplzik avatar Nov 20 '24 20:11 mplzik

Valid concern. I am working to get an answer, however, the most meaningful answer would likely come from an outside review of the Contributor License Agreement and the Business Source License.

crw avatar Nov 21 '24 19:11 crw

Note that the PR was created before any TF licensing changes happened, which complicates the matters a bit farther. Also, thanks a lot for getting the answers. :)

mplzik avatar Nov 21 '24 19:11 mplzik

Hi! I created an updated and simplified PR here: https://github.com/hashicorp/terraform/pull/36085

I did not include the suggested change by @liamcervante because the filesystem of the temporary directory may differ from the one of the provider cache. In that case, the system call to move the files will fail.

Instead, I'm creating a temporary directory inside the plugin cache directory.

pietrodn avatar Nov 23 '24 12:11 pietrodn

@pietrodn looking at your PR, it looks it's fairly similar to mine -- if it's based on it, the licensing concerns might apply to it as well. Another issue is that the original comments from @crw (https://github.com/hashicorp/terraform/pull/33479#issuecomment-1658914098) also stated that os.Rename is behaving diffently on different platforms and thus refactor might introduce some unexpected behavior. My proposal there was to hide this behind a flag (which I don't know whether is still acceptable). There might be some discussion left before this PR is mergeable (at least that is my interpretation of the discussion).

Also, I'm not able to sufficiently test PR anymore, so we're left with anything that CI/CD might detect.

mplzik avatar Nov 23 '24 19:11 mplzik

I'm looking at how uv (a modern Python package manager) handles its global cache of Python packages.

Apparently, they use a temporary file and then move it into place with a rename with retry function, that is a regular rename syscall for POSIX, and a rename with exponential backoff for Windows (ref).

Here is the higher-level function to download a wheel into a temporary directory and call the persist() function with the exponential backoff file rename.

pietrodn avatar Nov 23 '24 20:11 pietrodn

I've implemented the rename-with-retry method in a second commit, let me know if you like that as a solution to the Windows issue.

pietrodn avatar Nov 23 '24 20:11 pietrodn

@pietrodn reading closely Microsoft documentation, this approach might reduce the chances of race conditions happening, but won't fix the issue -- IIUC the move operation on Windows might not be atomic under the hood and any terraform invocations running in parallel might still catch an incomplete file (e.g. especially during the exponential backoff phase) -- I believe this is why https://github.com/hashicorp/terraform/pull/33479#issuecomment-1658914098 suggested refactoring the cache handling instead of using this code directly (and was my second reason why I stopped updating this PR -- @crw, was there any change on the opinion of refactoring the plugin cache handling code?).

Ideally, the cache should have a lockfile (or lockfiles) to avoid any concurrent updates, but that would be a bit bigger undertaking.

mplzik avatar Nov 24 '24 08:11 mplzik

I agree, but I don't see how merging your PR or mine would make Windows users worse off. Currrently, they are already forced to either not use the plugin cache or init their modules sequentially. The change would, at least, fix the issue for the other platforms.

pietrodn avatar Nov 24 '24 09:11 pietrodn

Neither of the PRs fixes the problem 100% as it is; your PR statistically improves the chances of problems not happening on Windows as well. Not long ago this PR was in a state where (to my best understanding) this was a root of the technical hurdles to accept this PR.

To be totally frank -- I'm not sure why are you opening another PR with code that is mostly based on this one. I wouldn't object if I completely abandoned the PR and wouldn't be answering, but I am answering the PR and discussing things and trying to find out reasonable technical and legal compromises. Non-coordinated parallel efforts waste human time.

And, lastly -- totally non-technical note: Having invested some time to dig through the Terraform codebase and fix this (despite not moving forward with the PR, thinking it's not acceptable and got license-wise complicated), from my point of view this looks to me as having this PR forked and submitted as your own -- which doesn't sound like a fair play. Chances are this might be just an accident, you starting your work on this before noticing my comments, and that's totally understandable, but let's figure a proper way to fix this (and give proper credits to all people involved along the way).

mplzik avatar Nov 24 '24 10:11 mplzik

This feature is something I need for my work as it would greatly speed up some CI processes. I saw this PR being not worked on, apparently only blocked by this comment by @liamcervante.

@crw then wrote:

If someone else in the community wants to pick this up and address Liam's comments (or if you can convince @mplzik to do it :), that submission would be considered for review. Thanks!

So I picked it up and I did it. Since I can't push to your repo, I had no other choice that doing my own. While working on it I realized that using the temporary file system would not work (as it would break Os.Rename() on different file systems). Since I had already done the work and tested it, I thought it would be useful to push a rebased and updated version of your change so that it could be merged.

I never intended to breach your intellectual property or to take credit. In all honesty, the contribution I did is trivial and I don't even care about attribution (or licensing of this change in general). I just care about this problem being fixed for Linux.

In your opinion, what's the best path forward to have a simple fix merged on Terraform mainline? I can commit a few hours of my time on trying to solve the problem. If you prefer that I close my PR and we work only on this, so be it—it's not a problem. 😄

pietrodn avatar Nov 24 '24 12:11 pietrodn

Hey, first of all, we're all here to fix the annoying TF bug :-) . Apologies if I sounded too grumpy (getting rid of bad cold); that wasn't something I wanted to see in the first place. I don't have a best way forward to have this merged due to objections expressed by hashicorp folks earlier in the comments.

Again, I'd like to wait for @crw's answer, but I did a small research to make the call about what the requirements are:

  • as for concerns from https://github.com/hashicorp/terraform/pull/33479#issuecomment-1852943585, IIUC either of our PRs can satisfy that by having an explicit opt-in flag or envvar. That way, this PR wouldn't change the critical path unless the end users opt in explicitly. Ideally, a couple of if os.GetEnv("OPT_IN_ENV") { /* fixed code */ else { /* original code */ } might do the trick, although it's not nice.
  • a more reasonable would probably be to take a step back and use filesystem-level locking and use lockfiles or similar mechanism to ensure that there's only one process writing the file and any process only reads the file once it's been written completely. This would avoid any race conditions inherently. I'm afraid that despite of this being a more (academically) valid solution, there would still need to be some kind of opt-in.
  • get more clear idea from @crw on any other preferred course action from Hashicorp's side (w.r.t. https://github.com/hashicorp/terraform/pull/33479#issuecomment-1852943585).

Since this is a PR older than the license switch, I'd love to be able to contribute this to opentofu as well, if possible -- that's I mentioned my license concerns. It's not a showstopper (I don't want to block PR that's publicly visible either way), but I think users of both projects deserve to have this fixed.

mplzik avatar Nov 24 '24 14:11 mplzik

I also thought about file locks. The Terraform codebase has already an implementation of file locks (for both UNIX and Windows!), so it could be feasible to extract it and use that.

pietrodn avatar Nov 24 '24 15:11 pietrodn

Nice! One of my main concerns here was that introducing an additional library (I found https://github.com/danjacques/gofslock) might again get problematic license-wise and a DIY solution would sidetrack the fix quite a lot.

mplzik avatar Nov 24 '24 15:11 mplzik

In #36085 I wrote a proof of concept implementation for the file locks. I ran it on a Terraform module monorepo that gave trouble when initializing with a shared cache, and the file lock seems to solve the problem (the file rename alone didn't).

Perhaps the tryFileLock mechanism needs to be made more robust but it's there to take some early feedback. :)

pietrodn avatar Nov 24 '24 16:11 pietrodn

Thanks for the continued thought on this issue. Unfortunately I do not have an update to the license question, the person (lawyer) who can offer an assessment on the IP implications is on a long vacation, so we are waiting his return.

In the meantime I will bring this PR and the associated new PRs up once again in triage to see if there is any change on willingness to review. Thanks!

crw avatar Dec 02 '24 22:12 crw