tofu-controller
tofu-controller copied to clipboard
Terraform State Lock not released
We've hit an issue where the State lock was not released due to the runner pod not shutting down correctly (assumption here) due to a node refresh on an EKS Cluster. So we ended up with a Kubernetes Lease
that was still held.
$ kubectl -n flux-system get lease
NAMESPACE NAME HOLDER AGE
...
flux-system lock-tfstate-default-sg1 3d7h
flux-system lock-tfstate-default-sg2 f2ab685b-f84d-ac0b-a125-378a22877e8d 3d7h
The Terraform
resource was reporting the following:
error running Plan: rpc error: code = Unknown desc = error acquiring the state lock: Lock Info: ID: f2ab685b-f84d-ac0b-a125-378a22877e8d Path: Operation: OperationTypePlan Who: controller@sg2-tf-runner Version: 1.1.9 Created: 2022-06-15 22:27:10.244794577 +0000 UTC
Which was repeated in the tf-controller
and tf-runner
logs. We were able to get things running again by deleting the Lease
in question.
So the ask here is to support the Terraform force-unlock feature.
I think the best approach might be to add a new field to the TerraformSpec
, e.g. forceUnlock
, the value of which should be the Lock Info: ID: [value]
, in the above example this would be forceUnlock: f2ab685b-f84d-ac0b-a125-378a22877e8d
. This ID corresponds to the key spec.holderIdentity
on a Lease
.
The tf-controller
/runner
should be tasked with acting on this, allowing for unlocking other Terraform backends that support locking if they become supported by the tf-controller
.
This could then be supported in the tfctl
e.g. tfctl force-unlock f2ab685b-f84d-ac0b-a125-378a22877e8d
which would set the field on the Terraform resource and force a reconciliation. Is there a potential race condition here?
Another possible solution would be to have the tf-controller
watching the Lease
s to see if a tf-runner
completes but fails to release the Lease
, the tf-controller
could then force the release of the Lease
, this wouldn't require user intervention and could maybe be a configuration of the tf-controller
or another configuration option of the Terraform
resource.
We found two issues that were the cause of this in two separate cases:
{
"level": "error",
"ts": "2022-06-15T20:14:34.824Z",
"logger": "runner.terraform",
"msg": "error creating the plan",
"error": "exit status 1
Error: Error releasing the state lock
Error message: etcdserver: request timed out
Terraform acquires a lock when accessing your state to prevent others
running Terraform to potentially modify the state at the same time. An
error occurred while releasing this lock. This could mean that the lock
did or did not release properly. If the lock didn't release properly,
Terraform may not be able to run future commands since it'll appear as if
the lock is held.
In this scenario, please call the \"force-unlock\" command to unlock the
state manually. This is a very dangerous operation since if it is done
erroneously it could result in two people modifying state at the same time.
Only call this command if you're certain that the unlock above failed and
that no one else is holding a lock."
}
And
I0615 22:29:32.020709 7 request.go:665] Waited for 1.027123219s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/node.k8s.io/v1?timeout=32s
{\"level\":\"info\",\"ts\":\"2022-06-15T22:29:40.658Z\",\"logger\":\"runner.terraform\",\"msg\":\"creating a plan\"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd165ea]
goroutine 359 [running]:
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).buildTerraformCmd(0x0, {0xc0005e0280, 0x1}, 0x162e028, {0xc000616400, 0x47d9be, 0x0})
/go/pkg/mod/github.com/tf-controller/[email protected]/tfexec/cmd.go:174 +0x2a
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).planCmd(0xc0003fc120, {0x1847728, 0xc00063a3c0}, {0xc0003fc130, 0x1, 0x10})
/go/pkg/mod/github.com/tf-controller/[email protected]/tfexec/plan.go:179 +0xb6f
github.com/hashicorp/terraform-exec/tfexec.(*Terraform).Plan(0x1498420, {0x1847728, 0xc00063a3c0}, {0xc0003fc130, 0xf, 0x0})
/go/pkg/mod/github.com/tf-controller/[email protected]/tfexec/plan.go:100 +0x31
github.com/weaveworks/tf-controller/runner.(*TerraformRunnerServer).Plan(0xc000ad63f0, {0x18477d0, 0xc0005247e0}, 0xc0003c3270)
/workspace/runner/server.go:384 +0x4e6
github.com/weaveworks/tf-controller/runner._Runner_Plan_Handler({0x15dcdc0, 0xc000ad63f0}, {0x18477d0, 0xc0005247e0}, 0xc0005c15c0, 0x0)
/workspace/runner/runner_grpc.pb.go:489 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0005c8000, {0x1873270, 0xc000603380}, 0xc0004c0000, 0xc000614cc0, 0x23c99c0, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc0005c8000, {0x1873270, 0xc000603380}, 0xc0004c0000, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1619 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
/go/pkg/mod/google.golang.org/[email protected]/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/pkg/mod/google.golang.org/[email protected]/server.go:919 +0x294
To expand on the above I'm looking at adding the following to the spec
:
spec:
backendConfig:
state:
forceUnlock: [holder-id]
By default, and following the same path as Terraform, this would require manual intervention by an operator, who would need to put this [holder-id]
in place in the the Terraform
resource. This can be done by changing the resource in the Flux Source
or by running tfctl force-unlock [terraform-resource] [holder-id]
, which would set the value and then trigger a reconcile for that resource.
Does tfctl reconcile
do the reconcile right away or is it queued up and take place X
time after you run the command? Ideally we'd like the tfctl force-unlock
to trigger a reconciliation immediately so that the Terraform
resource doesn't get reset by Flux before the reconcile happens.
I think it also makes sense to allow an automated in which case you would set spec.backendConfig.state.forceUnlock: auto
, this allows for the use case where the assumption is that nothing else will ever be using this state, so we are okay with the tf-controller
automatically unlocking the state. Obviously putting all the warnings around this in the documentation.
@chanwit @tomhuang12 @phoban01 what do you think about the above?
I have changed the above noted structure as it impacts the backendConfig
in a way that seems to complicate it more than I'd like.
So I've moved spec.backendConfig.state
out to spec.terraformState
and changed the structure of this a little to the following:
spec:
terraformState:
forceUnlock: (yes|no|auto)
lockIdentifier: [holder-id]
spec.terraformState.forceUnlock
defaults to no
Thank you @Nalum This design is excellent. It gives us auditable activities for force unlocking.
spec:
terraformState:
forceUnlock: (yes|no|auto)
lockIdentifier: [holder-id]
There are a few things to improve UX.
- I would change
terraformState
totfstate
so users would be able to reason that they're working withtfstate
at the moment. - What are the semantics of
forceUnlock=yes
andforceUnlock=auto
? How are they different?
Thanks for the feedback 👍
There are a few things to improve UX.
- I would change
terraformState
totfstate
so users would be able to reason that they're working withtfstate
at the moment.
Ah yes, this makes sense, will make this change 👍
- What are the semantics of
forceUnlock=yes
andforceUnlock=auto
? How are they different?
The purpose of auto
is to have the tf-controller
automatically unlock the state, in the case where the user is happy for this to happen, and yes
is a manual trigger, the expectation being that it is triggered from the tfctl
.
This has just made me think about some of the implementation and I need to revisit it.
Closing as #268 has been merged!