Burrow does not exit after panic

Open reimai opened this issue 2 years ago • 0 comments

Version: 1.6.0 Issue: burrow hangs (stops responding, but does not exit) after a failure to unlock from zk:

2023-10-02 17:33:19.940 |   {"level":"info","ts":1696257198.8336904,"msg":"re-submitting `0` credentials after reconnect","type":"coordinator","name":"zookeeper"}
2023-10-02 17:33:19.940 |  {"level":"info","ts":1696257198.8336573,"msg":"authenticated: id=74567085257124526, timeout=6000","type":"coordinator","name":"zookeeper"}
2023-10-02 17:33:19.940 |  {"level":"info","ts":1696257198.811102,"msg":"starting session","type":"coordinator","name":"zookeeper"}
2023-10-02 17:33:19.940 |  {"level":"info","ts":1696257198.8110363,"msg":"Connected to [zk-ip1]:2181","type":"coordinator","name":"zookeeper"}
2023-10-02 17:33:18.938 | stderr   	/home/runner/work/Burrow/Burrow/core/internal/notifier/coordinator.go:272 +0x1f1
2023-10-02 17:33:18.938 | stderr   created by github.com/linkedin/Burrow/core/internal/notifier.(*Coordinator).Start
2023-10-02 17:33:18.938 | stderr   	/home/runner/work/Burrow/Burrow/core/internal/notifier/coordinator.go:328 +0x505
2023-10-02 17:33:18.938 | stderr   github.com/linkedin/Burrow/core/internal/notifier.(*Coordinator).manageEvalLoop(0xc0000f0380)
2023-10-02 17:33:18.934 | stderr   goroutine 115 [running]:
2023-10-02 17:33:18.934 | stderr
2023-10-02 17:33:18.934 | stderr   panic: Unable to release zookeeper lock after session expiration

Seems like that panic was somehow recovered, because Burrow failed at was not printed. And the process did not died until 10 minutes later when I send it a SIGTERM.

A similar thing happens if I start it locally, without access to zk:

{"level":"panic","ts":1696264692.487353,"msg":"Failure to start zookeeper","type":"coordinator","name":"zookeeper","error":"lookup zk-host on [zk-ip]:53: no such host"}
panic: Failure to start zookeeper [recovered]
	panic: Failure to start zookeeper

goroutine 1 [running]:
main.handleExit()
	/home/runner/work/Burrow/Burrow/main.go:63 +0xf8
panic({0xbeb8a0, 0xc0003a60b0})
	/opt/hostedtoolcache/go/1.20.1/x64/src/runtime/panic.go:884 +0x213
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x7f680a4c45e8?, {0x0?, 0x0?, 0xc000132020?})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00011c000, {0xc000226180, 0x1, 0x1})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:264 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xc000226000?, {0xd227e4?, 0x0?}, {0xc000226180, 0x1, 0x1})
	/home/runner/go/pkg/mod/go.uber.org/[email protected]/logger.go:258 +0x59
github.com/linkedin/Burrow/core/internal/zookeeper.(*Coordinator).Start(0xc00014a240)
	/home/runner/work/Burrow/Burrow/core/internal/zookeeper/coordinator.go:87 +0x42b
github.com/linkedin/Burrow/core.Start(0xc000084540?, 0xc0001a5ef0?)
	/home/runner/work/Burrow/Burrow/core/burrow.go:158 +0x49b
main.main()
	/home/runner/work/Burrow/Burrow/main.go:114 +0x4d2

And no logs since that, the prcess is alive. This time the panic clearly has been recovered. I would very much like burrow to exit on network problems, so my orchestration could restart it.

Oct 02 '23 16:10 reimai