libnetwork
libnetwork copied to clipboard
deadlock in peerAddOp()
trafficstars
A customer ran into container operations hanged. From the stack dump, here's the goroutine that hangs:
goroutine 11387 [chan send, 1570 minutes]:
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*driver).peerInit(0xc44ba78500, 0xc4353174c0, 0x19)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/peerdb.go:302 +0xc6
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*driver).initSandboxPeerDB(0xc44ba78500, 0xc4353174c0, 0x19)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/peerdb.go:250 +0x41
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).joinSandbox.func1(0xc433f80460, 0xc42670ba07)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:320 +0x6f
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*network).joinSandbox(0xc433f80460, 0xc441e283c0, 0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/ov_network.go:352 +0x196
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*driver).peerAddOp(0xc44ba78500, 0xc4353174c0, 0x19, 0xc4339ed500, 0x40, 0xc457c16030, 0x10, 0x10, 0xc457c1600c, 0x4, ...)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/peerdb.go:388 +0x337
github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.(*driver).peerOpRoutine(0xc44ba78500, 0x5637981193c0, 0xc44ba7f2c0, 0xc44bf76180)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/peerdb.go:287 +0x361
created by github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay.Init
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/drivers/overlay/overlay.go:78 +0x1a7
peerOpRoutine() was in processing peerAddOp() which is triggered by the message from peerOpCh channel, but peerAddOp() invoked joinSandbox() which eventually sent a peerInit message to the peerOpCh channel, however peerOpRoutine() was already in processing peerAddOp(), it would never receive the peerInit message. So the goroutine blocked itself and we see [chan send, 1570 minutes].
It seems this is a rare corner case: the network sandbox has not been inited when peerAdd is called.