utils exec.Cmd.Stop() causes a panic when it tries to SIGKILL; fixing it is insufficient due to a data race

What happened:

If the command run by "utils/exec".New().Command() ignores SIGTERM and does not exit, or takes longer than 10 seconds to exit, using cmd.Stop() will result in a panic after 10 seconds.

What you expected to happen:

After SIGTERM fails to stop the process, Stop should send SIGKILL.

How to reproduce it:

I wrote a test that demonstrates the issue. It can be added to exec_test.go

func TestStopWhenSigtermIgnored(t *testing.T) {
  // Create a process that will ignore SIGTERM
  pr, pw, err := os.Pipe()
  if err != nil {
    t.Errorf("pipe: %v", err)
  }
  cmd := New().Command("/bin/sh", "-c", `trap "" TERM && echo "trap" && exec sleep 15`)
  cmd.SetStdin(nil)
  cmd.SetStdout(pw)
  cmd.SetStderr(pw)
  if err := cmd.Start(); err != nil {
    t.Errorf("expected Start() not to error, got: %v", err)
  }

  // Wait for SIGTERM's disposition to be set
  r := bufio.NewReader(pr)
  var line string
  if line, err = r.ReadString('\n'); err != nil {
    t.Errorf("ReadString: %v", err)
  }
  if line != "trap\n" {
    t.Errorf("unexpected write from shell script: %v", line)
  }

  // no race on process state after waiting 10 seconds to send SIGKILL
  cmd.Stop()
  if err := cmd.Wait(); err == nil || err.Error() != "signal: killed" {
    t.Errorf("expected Wait() to receive an error about a signal, got: %v", err)
  }
}

→ go test -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN   TestStopWhenSigtermIgnored
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1109a40]

goroutine 19 [running]:
os.(*ProcessState).exited(...)
	/usr/local/Cellar/go/1.20.6/libexec/src/os/exec_posix.go:83
os.(*ProcessState).Exited(...)
	/usr/local/Cellar/go/1.20.6/libexec/src/os/exec.go:155
k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
	/Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x20
created by time.goFunc
	/usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x32
FAIL	k8s.io/utils/exec	10.188s
FAIL

Anything else we need to know?:

The bug seems to come from a misconception about "os/exec".Cmd.ProcessState.Exited()

// Exited reports whether the program has exited.
// On Unix systems this reports true if the program exited due to calling exit,
// but false if the program terminated due to a signal.

However:

	// ProcessState contains information about an exited process.
	// If the process was started successfully, Wait or Run will
	// populate its ProcessState when the command completes.

That is, Cmd.ProcessState is nil until Cmd.Wait() successfully returns. Exited() really only is useful to distinguish if the process called exit() / returned normally, or was killed by a signal.

I also have a patch to address the panic:

 → git diff
diff --git a/exec/exec.go b/exec/exec.go
index d9c91e3..fd9ead2 100644
--- a/exec/exec.go
+++ b/exec/exec.go
@@ -186,7 +186,7 @@ func (cmd *cmdWrapper) Stop() {
        c.Process.Signal(syscall.SIGTERM)

        time.AfterFunc(10*time.Second, func() {
-               if !c.ProcessState.Exited() {
+               if c.ProcessState == nil {
                        c.Process.Signal(syscall.SIGKILL)
                }
        })

HOWEVER I'm concerned about a possible data race on the osexec.Cmd object between the time.AfterFunc goroutine and whichever routine calls cmd.Wait(). Running under the go race detector after the patch above has been applied does not show a race. But, if patched with the wrong condition (if c.ProcessState != nil {), then the race detector is triggered (It's not clear to me why the detection is sensitive to inverting the condition):

→ go test -race -count=1 -v ./exec/ -test.run ^TestStopWhenSigtermIgnored$
=== RUN   TestStopWhenSigtermIgnored
==================
WARNING: DATA RACE
Write at 0x00c0001b20a8 by goroutine 6:
  os/exec.(*Cmd).Wait()
      /usr/local/Cellar/go/1.20.6/libexec/src/os/exec/exec.go:894 +0x1b7
  k8s.io/utils/exec.(*cmdWrapper).Wait()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec.go:157 +0x30
  k8s.io/utils/exec.TestStopWhenSigtermIgnored()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec_test.go:160 +0x4ee
  testing.tRunner()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x47

Previous read at 0x00c0001b20a8 by goroutine 8:
  k8s.io/utils/exec.(*cmdWrapper).Stop.func1()
      /Users/dsharp/workspace/kubernetes-utils/exec/exec.go:189 +0x44

Goroutine 6 (running) created at:
  testing.(*T).Run()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1629 +0x805
  testing.runTests.func1()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2036 +0x8d
  testing.tRunner()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1576 +0x216
  testing.runTests()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:2034 +0x87c
  testing.(*M).Run()
      /usr/local/Cellar/go/1.20.6/libexec/src/testing/testing.go:1906 +0xb44
  main.main()
      _testmain.go:71 +0x2e9

Goroutine 8 (finished) created at:
  time.goFunc()
      /usr/local/Cellar/go/1.20.6/libexec/src/time/sleep.go:176 +0x47
==================
    exec_test.go:161: expected Wait() to receive an error about a signal, got: <nil>
    testing.go:1446: race detected during execution of test
--- FAIL: TestStopWhenSigtermIgnored (15.02s)
=== NAME
    testing.go:1446: race detected during execution of test
FAIL
FAIL	k8s.io/utils/exec	15.218s
FAIL

If sending SIGKILL is to depend on the termination state of cmd (and it seems like it should), then there needs to be some kind of synchronization point between the timer.AfterFunc goroutine and the goroutine calling Wait().

It's possible that Stop() should be removed. I could only find one user of "utils/exec".Cmd.Stop() in public repositories by searching with sourcegraph, and that is in kubernetes/kubernetes/pkg/volume/flexvolume/driver-call.go. That code itself is lacking sufficient synchronization, and is in a deprecated component. https://github.com/kubernetes/kubernetes/blob/99190634ab252604a4496882912ac328542d649d/pkg/volume/flexvolume/driver-call.go#L131

Environment:

Kubernetes version (use kubectl version): n/a
OS (e.g. from /etc/os-release): MacOS (dev environment)
Kernel (e.g. uname -a): Darwin
Install tools:
Others:

Sep 02 '23 03:09 dsharp-pivotal

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 27 '24 11:01 k8s-triage-robot

/remove-lifecycle stale

Jan 29 '24 23:01 dsharp-pivotal

I re-tested this, and it's still present. In reviewing, I found some errors in my analysis and patch, so I updated the description.

Jan 29 '24 23:01 dsharp-pivotal

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 29 '24 00:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

May 29 '24 01:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Jun 28 '24 01:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 28 '24 01:06 k8s-ci-robot