prom-run icon indicating copy to clipboard operation
prom-run copied to clipboard

Hang in reap() after timeout

Open bboreham opened this issue 5 years ago • 8 comments

Return from the http endpoint:

  <h2>Prometheus Command Runner</h2>
  <p>"/data/service-conf/ansible/ansible-diff prod" output:</p>
  <pre>command timed out</pre>
  <p>Run at 2018-11-03 19:46:53.80704496 &#43;0000 UTC took 10m0.00211331s<p>

Stack traces via kill -SIGQUIT:

goroutine 7 [syscall, 3982 minutes]:
syscall.Syscall6(0x3d, 0xffffffffffffffff, 0xc4205a3c44, 0x0, 0x0, 0x0, 0x0, 0x7fbbd37a8000, 0x0, 0x0)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.wait4(0xffffffffffffffff, 0xc4205a3c44, 0x0, 0x0, 0x24, 0xc4205a3cc0, 0x3)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:172 +0x88
syscall.Wait4(0xffffffffffffffff, 0xc4205a3c90, 0x0, 0x0, 0x3, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall_linux.go:256 +0x51
main.reap()
	/home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:35 +0xa9
main.main.func3(0xc420015a30, 0x7ffd23561149, 0x27, 0xc42000a130, 0x1, 0x1, 0xc420015a40, 0xc420015a90, 0xc42012a2c0, 0xc420015a80, ...)
	/home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:113 +0x633
created by main.main
	/home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:133 +0x333

goroutine 1351637 [chan receive, 3982 minutes]:
os/exec.(*Cmd).Wait(0xc42067c2c0, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:442 +0x124
os/exec.(*Cmd).Run(0xc42067c2c0, 0xc42005a150, 0x824d70)
	/usr/local/go/src/os/exec/exec.go:279 +0x48
os/exec.(*Cmd).CombinedOutput(0xc42067c2c0, 0x50a514, 0xc4205b6e80, 0xffffffffffffffff, 0xc4206aef90, 0x53df89)
	/usr/local/go/src/os/exec/exec.go:494 +0x140
main.main.func3.1(0xc42067c2c0, 0xc420482060)
	/home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:88 +0x2f
created by main.main.func3
	/home/ubuntu/.go_workspace/src/github.com/weaveworks/docker-ansible/vendor/github.com/tomwilkie/prom-run/main.go:90 +0x32d

I believe the source is https://github.com/tomwilkie/prom-run/blob/1e15c08f93a0286d8945b784ccad2dbdf0000def/main.go#L35

So both goroutines calling wait are stuck. I wonder if there is a race?

bboreham avatar Nov 06 '18 14:11 bboreham

What version are you running? From https://github.com/weaveworks/docker-ansible/blob/master/vendor/manifest I see its 1e15c08f93a0286d8945b784ccad2dbdf0000def

I'd try latest; I re-wrote this a while back to fix a bunch of issues.

tomwilkie avatar Nov 06 '18 15:11 tomwilkie

I did look at the latest code, and it doesn't seem any different in this respect - it still calls Wait4() to wait indefinitely.

bboreham avatar Nov 07 '18 17:11 bboreham

But it should never call it concurrently. On Wed, 7 Nov 2018 at 17:58, Bryan Boreham [email protected] wrote:

I did look at the latest code, and it doesn't seem any different in this respect - it still calls Wait4() to wait indefinitely.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tomwilkie/prom-run/issues/7#issuecomment-436718084, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbGhXYRKDwT1xgNogXy20P60-wtCmP8ks5usx9TgaJpZM4YQkkQ .

tomwilkie avatar Nov 07 '18 17:11 tomwilkie

Look at the stack traces - exec is calling wait and reap is calling wait.

bboreham avatar Nov 07 '18 19:11 bboreham

Unless I'm mistaken, master never calls exec and wait concurrently...

tomwilkie avatar Nov 07 '18 19:11 tomwilkie

However, reflecting on the ps output I saw, I wonder if the sub-process has not exited at all.

From the docs I expect cancel() to kill it.

bboreham avatar Nov 07 '18 19:11 bboreham

The code on master calls defer cancel() - how would that work on a timeout?

bboreham avatar Nov 07 '18 19:11 bboreham

Thats just idiomatic use of context; its a noop in this case.

This is where it ends up being used: https://github.com/golang/go/blob/go1.11/src/os/exec/exec.go#L408, and the waitDone channel will be closed first.

tomwilkie avatar Nov 07 '18 19:11 tomwilkie