terraform
terraform copied to clipboard
GCS backend: Detect and force-unlock stale locks.
This change fixes the issue of orphaned lock files, which Terraform leaves behind in some scenarios. Such lock files had to be removed manually, either by means of a force-unlock or by deleting the file in GCS.
The "updated" timestamp of the lock file is used as an indicator of staleness. This timestamp is updated once per minute as long as the lock is held. A lock file is considered stale and is force-unlocked if its timestamp hasn't been updated for several minutes.
@octo @danawillow
Thanks for the PR @ubschmidt2!
This seems like a reasonable feature to me, but it does require a little more care around the lock management than in the "fail locked" case that we have now, and for some cases the architecture of terraform itself prevents us from handling it very well at all.
The main situation to consider, is that now that we have locks available, users are putting this into automation and blocking sometimes concurrent operations using the lock. If the client holding the lock loses connectivity temporarily, this can cause a waiting instance to possible incorrectly obtain the lock (granted, losing connectivity for over 4min, then regaining it and completing the run would be a very rare case).
Optimally terraform core could detect this and abort from the unsafe situation, but that isn't really possible with the current architecture. The consul backend (in which the lock always requires active connectivity) does 2 things to try and protect the user however -- it knows when the lock was lost and can report an error to the user, and it uses a CAS operation to ensure that the final state being overwritten is the the correct version.
These things are tradeoff's of course, as if there are two instances applying at the same time, there is technically no "correct" state to write at the end, but I think at least reporting the situation to the user will alert them that there could be inconsistencies.
So, after that background from my adventures in state locking, I'll add the rest of the comments inline in the review.
Thanks, James, your review comments are very helpful. I'm working on the suggested improvements.
I've updated the pull request.
FWIW, the current Travis failure is unrelated to this PR.
# github.com/hashicorp/terraform/command
command/output_test.go:253:15: unknown field 'ContextOpts' in struct literal of type Meta
command/output_test.go:281:15: unknown field 'ContextOpts' in struct literal of type Meta
PTAL, if you have a chance.
PTAL, if you have a chance.
Hi @ubschmidt2,
Sorry about the long delay here, but we're nearing completion on the long 0.12 road and I'm going over PRs that were blocked by that work.
Can you rebase on master and re-run the tests?
Ah, sorry, James, missed your comment. Yes, I'll do that. Also, still busy here with adoption of 0.12. ;-)
Do we know if this is still an issue with the current backend? I have added it to the list of PRs for the GCS team to triage. Thanks!
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.