chromedp allocator gets stuck randomly

We've been using chromedp successfully on a variety of platforms and devices for almost a year now. It works wonderfully.

Recently we've noticed rare instances where chromedp gets stuck trying to launch chrome for some reason. Here's one example:

goroutine 41317 [select, 13499 minutes]:
github.com/chromedp/chromedp.(*ExecAllocator).Allocate(0xc001a56050, 0x54347a0, 0xc0011847e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	github.com/chromedp/[email protected]/allocate.go:189 +0x96b
github.com/chromedp/chromedp.Run(0x54347a0, 0xc0011847e0, 0xc0008e22f0, 0x2, 0x2, 0xc000ac25a0, 0x0)
	github.com/chromedp/[email protected]/chromedp.go:217 +0x209

goroutine 41311 [chan receive, 13499 minutes]:
github.com/chromedp/chromedp.(*ExecAllocator).Allocate.func2(0x54347a0, 0xc0011847e0, 0xc001852420, 0xc001852401, 0xc000044af0, 0x49, 0xc0013ce160, 0xc001a56050)
	github.com/chromedp/[email protected]/allocate.go:199 +0x4c
created by github.com/chromedp/chromedp.(*ExecAllocator).Allocate
	github.com/chromedp/[email protected]/allocate.go:198 +0xa2e

goroutine 41309 [chan receive, 13499 minutes]:
github.com/chromedp/chromedp.NewContext.func1(0x54347a0, 0xc0011847e0, 0xc0013ce160)
	github.com/chromedp/[email protected]/chromedp.go:128 +0x78
created by github.com/chromedp/chromedp.NewContext
	github.com/chromedp/[email protected]/chromedp.go:127 +0x22a

This particular instance was on macOS, but we've seen the same behavior on Windows as well.

At the very least it seems like some sort of timeout would be helpful here:

https://github.com/chromedp/chromedp/blob/aa5cc8729d9451c668fec01ca0a7e7b4cf8b74b5/allocate.go#L189-L193

Apr 10 '20 17:04 tmm1

Hm. Have you been able to reproduce this reliably in any way?

Apr 17 '20 16:04 mvdan

I just found a repro on one of my devices:

goroutine 104 [select, 1 minutes]:
github.com/chromedp/chromedp.(*ExecAllocator).Allocate(0xbce7c510, 0xba8ec008, 0xbca040a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	github.com/chromedp/[email protected]/allocate.go:190 +0x790
github.com/chromedp/chromedp.Run(0xba8ec008, 0xbca040a0, 0xbcf3f3bc, 0x2, 0x2, 0x1, 0xbce74b10)
	github.com/chromedp/[email protected]/chromedp.go:228 +0x188


goroutine 132 [chan receive, 1 minutes]:
github.com/chromedp/chromedp.NewContext.func1(0xba8ec008, 0xbca040a0, 0xbccf6a20)
	github.com/chromedp/[email protected]/chromedp.go:130 +0x50
created by github.com/chromedp/chromedp.NewContext
	github.com/chromedp/[email protected]/chromedp.go:129 +0x21c


goroutine 134 [chan receive, 1 minutes]:
github.com/chromedp/chromedp.(*ExecAllocator).Allocate.func2(0xba8ec008, 0xbca040a0, 0xbcb113f0, 0xbcb11301, 0xbc83b500, 0x53, 0xbccf6a20, 0xbce7c510)
	github.com/chromedp/[email protected]/allocate.go:200 +0x38
created by github.com/chromedp/chromedp.(*ExecAllocator).Allocate
	github.com/chromedp/[email protected]/allocate.go:199 +0x824

What data can I gather? I'm pretty sure the underlying chrome binary is SEGVing on launch, so perhaps pointing the allocator at a shell script that exits unexpectedly will reproduce in the tests.

May 15 '20 00:05 tmm1

Spent a ton of time trying to reproduce this today. I kept trying to add a test I expected would fail, but it wasn't failing even on the same setup.

Finally I discovered a subtle bug in my app's calling code:

err := chromedp.Run(ctx)
if err, ok := err.(*exec.Error); ok && err.Err == exec.ErrNotFound {
    return fmt.Errorf("chrome was not found on your system")
} else if err != nil {
    return err
}
err = chromedp.Run(ctx, ....)

Apparently the if statement reusing err overwrites the value in the !ok case, causing the else to fail and then chromedp.Run is called again and hangs forever.

The bug can be fixed in my app by renaming err in the if statement, however I think chromedp should also complain loudly if the context is reused like this? Here's a failing test for that:

diff --git a/allocate_test.go b/allocate_test.go
index 3610c0f..15b4795 100644
--- a/allocate_test.go
+++ b/allocate_test.go
@@ -104,6 +104,29 @@ func TestExecAllocatorKillBrowser(t *testing.T) {
        }
 }

+func TestExecAllocatorFailReuse(t *testing.T) {
+       t.Parallel()
+
+       // To ensure we don't actually fire any Chrome processes.
+       allocCtx, _ := NewExecAllocator(context.Background(),
+               ExecPath("testdata/chromecrash.sh"))
+
+       ctx, cancel := NewContext(allocCtx)
+       defer cancel()
+       if err := Run(ctx); err == nil {
+               t.Fatalf("expected err but got nil")
+       } else {
+               t.Logf("err1: %v", err)
+       }
+
+       // Try to use expired context again
+       if err := Run(ctx); err == nil {
+               t.Fatalf("expected err but got nil")
+       } else {
+               t.Logf("err2: %v", err)
+       }
+}
+
 func TestSkipNewContext(t *testing.T) {
        t.Parallel()

diff --git a/testdata/chromecrash.sh b/testdata/chromecrash.sh
new file mode 100755
index 0000000..7f7e33d
--- /dev/null
+++ b/testdata/chromecrash.sh
@@ -0,0 +1,2 @@
+#!/bin/sh
+exit 11

which reproduces the hang readily on my mac:

$ go test -v ./ -run TestExecAllocatorFailReuse
=== RUN   TestExecAllocatorFailReuse
=== PAUSE TestExecAllocatorFailReuse
=== CONT  TestExecAllocatorFailReuse
    TestExecAllocatorFailReuse: allocate_test.go:119: err1: chrome failed to start:
^\SIGQUIT: quit
PC=0x7fff70108882 m=0 sigcode=0

goroutine 0 [idle]:
runtime.pthread_cond_wait(0x1a612c8, 0x1a61288, 0x0)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/sys_darwin.go:390 +0x39
runtime.semasleep(0xffffffffffffffff, 0x7ffeefbff570)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/os_darwin.go:63 +0x85
runtime.notesleep(0x1a61088)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/lock_sema.go:173 +0xe0
runtime.stopm()
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/proc.go:1828 +0xc0
runtime.findrunnable(0xc000044000, 0x0)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/proc.go:2360 +0xa0d
runtime.schedule()
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/proc.go:2520 +0x2fc
runtime.park_m(0xc0001a2180)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/proc.go:2690 +0x9d
runtime.mcall(0x10658c6)
	/usr/local/Cellar/go/1.14.3/libexec/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [chan receive]:
testing.tRunner.func1(0xc0000c1200)
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:956 +0x238
testing.tRunner(0xc0000c1200, 0xc00006fd78)
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:995 +0x10b
testing.runTests(0xc00000e6e0, 0x1a5a180, 0x5d, 0x5d, 0x0)
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:1282 +0x2a7
testing.(*M).Run(0xc000190000, 0x0)
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:1199 +0x15f
github.com/chromedp/chromedp.TestMain(0xc000190000)
	/Users/tmm1/fancybits/chromedp/chromedp_test.go:82 +0x138
main.main()
	_testmain.go:252 +0x135

goroutine 6 [select]:
github.com/chromedp/chromedp.(*ExecAllocator).Allocate(0xc0000322d0, 0x16a2200, 0xc00019e090, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/tmm1/fancybits/chromedp/allocate.go:190 +0x96a
github.com/chromedp/chromedp.Run(0x16a2200, 0xc00019e090, 0x0, 0x0, 0x0, 0x1, 0xc0000738f0)
	/Users/tmm1/fancybits/chromedp/chromedp.go:228 +0x209
github.com/chromedp/chromedp.TestExecAllocatorFailReuse(0xc0000c1320)
	/Users/tmm1/fancybits/chromedp/allocate_test.go:123 +0x1c2
testing.tRunner(0xc0000c1320, 0x15fa428)
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:1042 +0x357

goroutine 7 [chan receive]:
testing.runTests.func1.1(0xc0000c1200)
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:1289 +0x3b
created by testing.runTests.func1
	/usr/local/Cellar/go/1.14.3/libexec/src/testing/testing.go:1289 +0xac

goroutine 8 [chan receive]:
github.com/chromedp/chromedp.NewContext.func1(0x16a2200, 0xc00019e090, 0xc0000b04d0)
	/Users/tmm1/fancybits/chromedp/chromedp.go:130 +0x78
created by github.com/chromedp/chromedp.NewContext
	/Users/tmm1/fancybits/chromedp/chromedp.go:129 +0x22a

goroutine 9 [select]:
os/exec.(*Cmd).Start.func2(0xc0000ce420)
	/usr/local/Cellar/go/1.14.3/libexec/src/os/exec/exec.go:449 +0xc4
created by os/exec.(*Cmd).Start
	/usr/local/Cellar/go/1.14.3/libexec/src/os/exec/exec.go:448 +0x6d0

goroutine 10 [chan receive]:
github.com/chromedp/chromedp.(*ExecAllocator).Allocate.func2(0x16a2200, 0xc00019e090, 0xc0000ce420, 0xc0000ce401, 0xc0000907e0, 0x60, 0xc0000b04d0, 0xc0000322d0)
	/Users/tmm1/fancybits/chromedp/allocate.go:200 +0x4c
created by github.com/chromedp/chromedp.(*ExecAllocator).Allocate
	/Users/tmm1/fancybits/chromedp/allocate.go:199 +0xa2d

goroutine 12 [select]:
os/exec.(*Cmd).Start.func2(0xc0000ce580)
	/usr/local/Cellar/go/1.14.3/libexec/src/os/exec/exec.go:449 +0xc4
created by os/exec.(*Cmd).Start
	/usr/local/Cellar/go/1.14.3/libexec/src/os/exec/exec.go:448 +0x6d0

rax    0x104
rbx    0x2
rcx    0x7ffeefbff398
rdx    0x200
rdi    0x1a612c8
rsi    0x30100000400
rbp    0x7ffeefbff430
rsp    0x7ffeefbff398
r8     0x0
r9     0xa0
r10    0x0
r11    0x202
r12    0x1a612c8
r13    0x16
r14    0x30100000400
r15    0x7365dc0
rip    0x7fff70108882
rflags 0x203
cs     0x7
fs     0x0
gs     0x0
FAIL	github.com/chromedp/chromedp	1.787s

May 20 '20 23:05 tmm1

@mvdan Any thoughts here? I can put together a PR if we agree on approach.

May 27 '20 21:05 tmm1

Agreed, please send a PR. Two thoughts:

Think about whether the fix would be racy; for example, in the sequence of Run(ctx); cancel(); Run(ctx), since the cancellation work is asynchronous, we don't want it to possibly race with the second Run.
The test shouldn't use a script, because that's not portable. What you can do is call os.Args[0] with a special env var like CHROMEDP_TEST_COMMAND=fail, then in TestMain you do if os.Getenv("CHROMEDP_TEST_COMMAND") == "fail" { os.Exit(1) }. This is effectively reusing the test binary to run some Go code. It's nice because it's portable, and doesn't require linking another binary.

May 27 '20 21:05 mvdan

chromedp chromedp copied to clipboard

allocator gets stuck randomly

chromedp
chromedp copied to clipboard