libnetwork icon indicating copy to clipboard operation
libnetwork copied to clipboard

controller deadlock

Open jaroslawr opened this issue 3 years ago • 4 comments

docker deamon version 20.10.11 regularly runs into a deadlock related to the controller struct, with the blocked goroutines looking like this:

goroutine 7261778 [semacquire, 144 minutes]:
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc00da48af8)
        /usr/local/go/src/sync/mutex.go:138 +0x10f
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).SandboxDestroy(0xc00da48a00, 0xc0130f50c0, 0x40, 0xc000241ea0, 0xc0095c2000)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/controller.go:1241 +0x1bc
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc00000c3c0, 0xc00a95f680, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/container_operations.go:540 +0xcf
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc00000c3c0, 0xc00a95f680, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/container_operations.go:987 +0x90
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc00000c3c0, 0xc00a95f680, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/start.go:149 +0x325

0xc00da48afc is the mutex that is not being released, I cannot tell what is holding the mutex, but I attach two goroutine dumps taken a few minutes apart: goroutine-stacks.zip

> grep 0xc00da48afc goroutine-stacks-2022-01-04T122803Z.log | wc -l
1349
> grep 0xc00da48afc goroutine-stacks-2022-01-04T122803Z.log | head -n5
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
sync.runtime_SemacquireMutex(0xc00da48afc, 0x556ce13bcc00, 0x1)
> grep 0xc00da48afc goroutine-stacks-2022-01-04T123842Z.log | wc -l
1405

jaroslawr avatar Jan 04 '22 13:01 jaroslawr

I am not sure but I guess the goroutine holding this 0xc00da48afc mutex is this one:

goroutine 6226 [semacquire, 199 minutes]:
sync.runtime_SemacquireMutex(0xc000d35a34, 0x0, 0x1)
        /usr/local/go/src/runtime/sema.go:71 +0x49
sync.(*Mutex).lockSlow(0xc000d35a30)
        /usr/local/go/src/sync/mutex.go:138 +0x10f
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).ID(0xc000d35880, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:254 +0xad
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).processEndpointDelete(0xc00da48a00, 0xc007b30720, 0xc00477ab00)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:403 +0x97
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).watchLoop(0xc00da48a00)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:434 +0xf8
created by github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).startWatch
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:447 +0xd8

It is itself blocked on another lock...

jaroslawr avatar Jan 04 '22 14:01 jaroslawr

@thaJeztah I think possibly the lock ordering introduced in PR #2633 was not applied to the method processEndpointDelete, currently the method does this:

func (c *controller) processEndpointDelete(nmap map[string]*netWatch, ep *endpoint) {
	n := ep.getNetwork()
	if !c.isDistributedControl() && n.Scope() == datastore.SwarmScope && n.driverIsMultihost() {
		return
	}

	c.Lock()
	nw, ok := nmap[n.ID()]

but following the logic of that PR, I guess it should say something like:

func (c *controller) processEndpointDelete(nmap map[string]*netWatch, ep *endpoint) {
	n := ep.getNetwork()
	if !c.isDistributedControl() && n.Scope() == datastore.SwarmScope && n.driverIsMultihost() {
		return
	}

	networkID := n.ID()
	c.Lock()
	nw, ok := nmap[networkID]

(calling n.ID() before c.Lock())

jaroslawr avatar Jan 04 '22 15:01 jaroslawr

@jaroslawr We also found the deadlock on processEndpointDelete, and it has race condition with function getSvcRecords:

func (n *network) getSvcRecords(ep *endpoint) []etchosts.Record {
	n.Lock()
	defer n.Unlock()

	if ep == nil {
		return nil
	}

	var recs []etchosts.Record

	epName := ep.Name()

	n.ctrlr.Lock()
	defer n.ctrlr.Unlock()
	sr, ok := n.ctrlr.svcRecords[n.id]
	if !ok || sr.svcMap == nil {
		return nil
	}

getSvcRecords lock with order network->controller, but processEndpointDelete lock with reverse order controller->network, and this will cause race condition.

Our goroutine stack log with the deadlock:

        /usr/local/go/src/sync/mutex.go:138 +0xec
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).ID(0x40007faa80, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:254 +0xc8
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).processEndpointDelete(0x400062a400, 0x4000857740, 0x40059f8160)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:403 +0x9c
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*controller).watchLoop(0x400062a400)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/store.go:434 +0xb8
sync.(*Mutex).lockSlow(0x400062a4f8)
        /usr/local/go/src/sync/mutex.go:138 +0xec
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*network).getSvcRecords(0x40007faa80, 0x40059f82c0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/network.go:1487 +0x56c
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*endpoint).sbLeave(0x40059f82c0, 0x4005a98b40, 0x0, 0x0, 0x0, 0x0, 0x40004be6e8, 0x1193934)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/endpoint.go:785 +0x520
github.com/docker/docker/vendor/github.com/docker/libnetwork.(*endpoint).Leave(0x4005aaedc0, 0x2eda9a0, 0x4005a98b40, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/endpoint.go:707 +0x13c

tossmilestone avatar Apr 18 '22 08:04 tossmilestone

@tossmilestone I think this repository is not maintained anymore, instead libnetwork is maintained as part of moby/moby, where this problem was fixed in:

https://github.com/moby/moby/commit/7c97896747726554165480d102d9e46c54334cba https://github.com/moby/moby/commit/0c1a125644f4626b294cdd4b5862a728b1d974a7

The fixes are not part of any official release yet, there will be part of release 22.

jaroslawr avatar Apr 18 '22 10:04 jaroslawr