libnetwork
libnetwork copied to clipboard
controller deadlock
docker deamon version 20.10.11 regularly runs into a deadlock related to the controller struct, with the blocked goroutines looking like this:
goroutine 7261778 [semacquire, 144 minutes]:
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
/usr/local/go/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc00da48af8)
/usr/local/go/src/sync/mutex.go:138 +0x10f
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).SandboxDestroy(0xc00da48a00, 0xc0130f50c0, 0x40, 0xc000241ea0, 0xc0095c2000)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:1241 +0x1bc
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc00000c3c0, 0xc00a95f680, 0x0, 0x0)
/go/src/github.com/docker/docker/daemon/container_operations.go:540 +0xcf
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc00000c3c0, 0xc00a95f680, 0x0, 0x0)
/go/src/github.com/docker/docker/daemon/container_operations.go:987 +0x90
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc00000c3c0, 0xc00a95f680, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0)
/go/src/github.com/docker/docker/daemon/start.go:149 +0x325
0xc00da48afc is the mutex that is not being released, I cannot tell what is holding the mutex, but I attach two goroutine dumps taken a few minutes apart:
goroutine-stacks.zip
> grep 0xc00da48afc goroutine-stacks-2022-01-04T122803Z.log | wc -l
1349
> grep 0xc00da48afc goroutine-stacks-2022-01-04T122803Z.log | head -n5
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
> grep 0xc00da48afc goroutine-stacks-2022-01-04T123842Z.log | wc -l
1405
I am not sure but I guess the goroutine holding this 0xc00da48afc mutex is this one:
goroutine 6226 [semacquire, 199 minutes]:
sync.runtime_SemacquireMutex(0xc000d35a34, 0x0, 0x1)
/usr/local/go/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc000d35a30)
/usr/local/go/src/sync/mutex.go:138 +0x10f
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).ID(0xc000d35880, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:254 +0xad
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).processEndpointDelete(0xc00da48a00, 0xc007b30720, 0xc00477ab00)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:403 +0x97
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).watchLoop(0xc00da48a00)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:434 +0xf8
created by github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).startWatch
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:447 +0xd8
It is itself blocked on another lock...
@thaJeztah I think possibly the lock ordering introduced in PR #2633 was not applied to the method processEndpointDelete, currently the method does this:
func (c *controller) processEndpointDelete(nmap map[string]*netWatch, ep *endpoint) {
n := ep.getNetwork()
if !c.isDistributedControl() && n.Scope() == datastore.SwarmScope && n.driverIsMultihost() {
return
}
c.Lock()
nw, ok := nmap[n.ID()]
but following the logic of that PR, I guess it should say something like:
func (c *controller) processEndpointDelete(nmap map[string]*netWatch, ep *endpoint) {
n := ep.getNetwork()
if !c.isDistributedControl() && n.Scope() == datastore.SwarmScope && n.driverIsMultihost() {
return
}
networkID := n.ID()
c.Lock()
nw, ok := nmap[networkID]
(calling n.ID() before c.Lock())
@jaroslawr We also found the deadlock on processEndpointDelete, and it has race condition with function getSvcRecords:
func (n *network) getSvcRecords(ep *endpoint) []etchosts.Record {
n.Lock()
defer n.Unlock()
if ep == nil {
return nil
}
var recs []etchosts.Record
epName := ep.Name()
n.ctrlr.Lock()
defer n.ctrlr.Unlock()
sr, ok := n.ctrlr.svcRecords[n.id]
if !ok || sr.svcMap == nil {
return nil
}
getSvcRecords lock with order network->controller, but processEndpointDelete lock with reverse order controller->network, and this will cause race condition.
Our goroutine stack log with the deadlock:
/usr/local/go/src/sync/mutex.go:138 +0xec
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).ID(0x40007faa80, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:254 +0xc8
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).processEndpointDelete(0x400062a400, 0x4000857740, 0x40059f8160)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:403 +0x9c
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).watchLoop(0x400062a400)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:434 +0xb8
sync.(*Mutex).lockSlow(0x400062a4f8)
/usr/local/go/src/sync/mutex.go:138 +0xec
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).getSvcRecords(0x40007faa80, 0x40059f82c0, 0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:1487 +0x56c
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*endpoint).sbLeave(0x40059f82c0, 0x4005a98b40, 0x0, 0x0, 0x0, 0x0, 0x40004be6e8, 0x1193934)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/endpoint.go:785 +0x520
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*endpoint).Leave(0x4005aaedc0, 0x2eda9a0, 0x4005a98b40, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/endpoint.go:707 +0x13c
@tossmilestone I think this repository is not maintained anymore, instead libnetwork is maintained as part of moby/moby, where this problem was fixed in:
https://github.com/moby/moby/commit/7c97896747726554165480d102d9e46c54334cba https://github.com/moby/moby/commit/0c1a125644f4626b294cdd4b5862a728b1d974a7
The fixes are not part of any official release yet, there will be part of release 22.