Termination grace period is not respected when the pod is killed.
Hi Team, We started to notice weird behaviour where many locks were not released when the tf-runners were killed, even when the graceful shutdown was set to 1 hour. After some investigation, we noticed that when the pod was restarted, the SIGKILL forcefully killed the Terraform process; therefore, it doesn’t start a graceful shutdown, which sometimes ends with the state lock not being released.
We found these related issues: https://github.com/hashicorp/terraform-exec/pull/332 https://github.com/hashicorp/terraform-exec/pull/334
logs:
2022-11-29T21:42:10.266800268Z Error message: workspace already locked (lock ID:
2022-11-29T21:42:10.266795895Z
2022-11-29T21:42:10.266789968Z Error: Error acquiring the state lock
2022-11-29T21:42:10.266760286Z
2022-11-29T21:42:07.606685256Z info creating a plan
2022-11-29T21:42:07.587542369Z info mapping the Spec.VarsFrom
2022-11-29T21:42:07.587537646Z info mapping the Spec.Vars
2022-11-29T21:42:07.587512719Z info setting up the input variables
2022-11-29T21:42:07.58182462Z Terraform has been successfully initialized!
2022-11-29T21:42:07.581792524Z
2022-11-29T21:42:07.581420214Z you run "terraform init" in the future.
2022-11-29T21:42:07.581416395Z so that Terraform can guarantee to make the same selections by default when
2022-11-29T21:42:07.58141258Z selections it made above. Include this file in your version control repository
2022-11-29T21:42:07.581408747Z Terraform has created a lock file .terraform.lock.hcl to record the provider
2022-11-29T21:42:07.581401961Z
2022-11-29T21:42:00.820815414Z Initializing provider plugins...
2022-11-29T21:42:00.820797552Z
2022-11-29T21:42:00.481037098Z use this backend unless the backend configuration changes.
2022-11-29T21:42:00.481032663Z Successfully configured the backend "remote"! Terraform will automatically
2022-11-29T21:42:00.48101017Z
2022-11-29T21:42:00.198431966Z Initializing the backend...
2022-11-29T21:41:58.413545513Z Upgrading modules...
2022-11-29T21:41:58.386167971Z }
2022-11-29T21:41:58.38616475Z "terraform_outdated": true
2022-11-29T21:41:58.386161487Z "provider_selections": {},
2022-11-29T21:41:58.386157733Z "platform": "linux_amd64",
2022-11-29T21:41:58.386153382Z "terraform_version": "1.1.9",
2022-11-29T21:41:58.386133029Z {
2022-11-29T21:41:58.251028908Z info mapping the Spec.BackendConfigsFrom
2022-11-29T21:41:58.249567118Z info initializing
2022-11-29T21:41:58.245264737Z info getting envvars from os environments
2022-11-29T21:41:58.24525401Z info setting envvars
2022-11-29T21:41:58.241270762Z info creating new terraform
2022-11-29T21:41:58.237230706Z info looking for path
2022-11-29T21:41:58.229338345Z info processing configuration
2022-11-29T21:41:58.225878148Z info write config to file
2022-11-29T21:41:58.225855343Z info write backend config
2022-11-29T21:41:58.223764885Z info preparing for Upload and Extraction
2022-11-29T21:36:25.886572001Z info cleanup TmpDir
2022-11-29T21:36:25.88396556Z error error creating the plan
2022-11-29T21:36:16.127120371Z info creating a plan
events:
2022-11-29 21:36:26 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Stopping container tf-runner
2022-11-29 21:36:26 Warning Terraform env-76461b07-9216-4d60-8a43-c5a30ddfdd8d error running Plan: rpc error: code = Internal desc = signal: killed
2022-11-29 21:36:25 Warning Terraform env-76461b07-9216-4d60-8a43-c5a30ddfdd8d Plan error: rpc error: code = Internal desc = signal: killed
2022-11-29 21:35:54 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Started container tf-runner
2022-11-29 21:35:53 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Created container tf-runner
2022-11-29 21:35:53 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Container image "ghcr.io/weaveworks/tf-runner:v0.13.0-rc.1" already present on machine
2022-11-29 21:35:51 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Successfully assigned terraform-runners/env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner
2022-11-29 21:35:46 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Successfully assigned terraform-runners/env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner
2022-11-29 21:35:46 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Stopping container tf-runner
2022-11-29 21:35:46 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Started container tf-runner
2022-11-29 21:35:46 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Created container tf-runner
2022-11-29 21:35:46 Normal Pod env-76461b07-9216-4d60-8a43-c5a30ddfdd8d-tf-runner Container image "ghcr.io/weaveworks/tf-runner:v0.13.0-rc.1" already present on machine
Is there something we can do to avoid this behaviour without setting force unlock to auto?
Thanks
@Nalum could you please give some advice here?
BTW @TarasLykhenko could you please update tf-controller to v0.13.1? the v0.13.0.rc.1 has many known issues fixed by the more recent version.
Unfortunately I'm not sure there is another way to handle this without dealing with the locked state, at least until hashicorp/terraform-exec#334 comes into play.
You don't have to use the auto value but you will need to deal with the locked state either with auto or manually by setting the required fields:
spec:
tfstate:
forceUnlock: "yes"
lockIdentifier: [lock-id-here]
FYI check out my workaround from https://github.com/weaveworks/tf-controller/issues/531#issuecomment-1452292954
This is from a write-up I did analysing the problem in some depth.
When Kubernetes terminates a pod the init process is sent SIGTERM.
tf-controller uses tini https://github.com/weaveworks/tf-controller/blob/29c37473906d2d664e25db86f76eb5832ec5c7ee/runner-base.Dockerfile#L66
ENTRYPOINT [ "/sbin/tini", "--", "tf-runner" ]
tini handles receiving signals, by signalling its child process (in this case the tf-runner).
/* There is a signal to handle here */
switch (sig.si_signo) {
case SIGCHLD:
/* Special-cased, as we don't forward SIGCHLD. Instead, we'll
* fallthrough to reaping processes.
*/
PRINT_DEBUG("Received SIGCHLD");
break;
default:
PRINT_DEBUG("Passing signal: '%s'", strsignal(sig.si_signo));
/* Forward anything else */
if (kill(kill_process_group ? -child_pid : child_pid, sig.si_signo)) {
if (errno == ESRCH) {
PRINT_WARNING("Child was dead when forwarding signal");
} else {
PRINT_FATAL("Unexpected error when forwarding signal: '%s'", strerror(errno));
return 1;
}
}
break;
Note that tini can be configured to signal an entire process group, but see later for why this doesn't impact on the terraform process.
tf-runner sets itself up to handle signals, when the signal is received, the signal is placed on to a channel.
// catch the SIGTERM from the kubelet to gracefully terminate
sigterm := make(chan os.Signal, 1)
signal.Notify(sigterm, syscall.SIGTERM)
defer func() {
signal.Stop(sigterm)
}()
```
This channel is passed through to the [TerraformRunnerServer](https://github.com/weaveworks/tf-controller/blob/main/mtls/runner_serve.go#L47).
Within the `Apply` method, there is code that [cancels a context upon receiving the signal](https://github.com/weaveworks/tf-controller/blob/29c37473906d2d664e25db86f76eb5832ec5c7ee/runner/server.go#L356-L357).
```go
ctx, cancel := context.WithCancel(ctx)
go func() {
select {
case <-r.Done:
cancel()
case <-ctx.Done():
}
}()
This cancellation is significant, as the assumption here is that the tf-runner process is signalled with SIGTERM and this is resulting in the cancellation of the context.
This context is passed to github.com/hashicorp/terraform-exec through to tf-exec code which delegates executing the generated command-line to Go's exec.Command.
There are several implementations of runTerraformCmd but the Linux version is what applies here.
This code does this:
cmd.SysProcAttr = &syscall.SysProcAttr{
// kill children if parent is dead
Pdeathsig: syscall.SIGKILL,
// set process group ID
Setpgid: true,
}
This does two different things, Setpgid: true says that the child process should be put into a separate process group, this means that signals sent to the parent process (tf-runner) will never be automatically propagated to the child process (terraform), this is why even configuring tini to send signals to the process-group would not fix the problem we're seeing.
The other thing it does, Pdeathsig: syscall.SIGKILL is crucial because this impacts on how process termination occurs on Linux.
The runTerraformCmd code starts the child process and then waits for it to complete.
The current assumption is that the terraform process sets up similar signal handling to tf-runner and handles SIGTERM by unlocking the lock.
While the terraform process is executing, tf-runner is receiving a SIGTERM, and so it closes the context, this results in SIGKILL being sent to terraform.
This is bad!
Let's go back to the Pdeathsig configuration.
This sets an attribute on the spawned process (only on Linux) that defines the signal to be sent when the thread that started the process is terminated.
This is Go, we don't do threads...except that Goroutines are multiplexed on top of a set of threads that are managed by the Go runtime, the runtime can create and destroy threads as necessary, this describes the process really well.
So, at any point in time, the thread that starts the terraform process could be terminated which would result (because of Pdeathsig) in the spawned process being SIGKILLed.
https://github.com/golang/go/issues/27505 recommends locking the thread around this which would presumably prevent it being killed off by the runtime.
We know that switching from SIGKILL to SIGTERM as the Pdeathsig has resulted in a complete drop-off in fail-to-unlock issues, which implies that this is due to the thread dying (as to why, it could be because the parent process terminating).
We know that switching from
SIGKILLtoSIGTERMas thePdeathsighas resulted in a complete drop-off in fail-to-unlock issues, which implies that this is due to the thread dying (as to why, it could be because the parent process terminating).
Additionally: since sending SIGTERM to the tf-runner process would result in the context being cancelled, and process.Kill() being called, there's at least a strong suggestion that SIGTERM is not being used. If it were, then changing Pdeathsig would not make any difference, because the terraform processes would be killed by process.Kill() both before and after the change.
Kubelet will use SIGKILL without SIGTERM and a grace period, in some circumstances: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#hard-eviction-thresholds
Regarding the kubelet and SIGKILL without SIGTERM, I actually observed this during troubleshooting where a group of tf-runner pods were evicted due to node pressure caused by disk(imagefs.available.) The commonality was that all of the tf-runner pods that were scheduled on that node with disk pressure resulted in state locking issues on subsequent runs.
Looks like https://github.com/hashicorp/terraform-exec/pull/512 can resolve this issue, and tfexec uses a MPL-2.0 license so it's fine to upgrade to it.
Only problem is that it introduces a set of breaking changes in https://github.com/hashicorp/terraform-exec/releases/tag/v0.18.0 which is the reason it's pinned in the go.mod file. I haven't looked into the consequences of removing the specific error handling from tofu-controller. So it's a remaining task before being able to resolve this issue.