swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

fix crash when there are attachments with null network

Open andrey-ko opened this issue 6 years ago • 15 comments

fix for crash that I caught while playing with swarm cluster. here the stack trace of this crash:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x20257db]

goroutine 5668 [running]:
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNode(0xc0012db840, 0x2bcf040, 0xc001360600, 0xc00008b040, 0x0, 0xc001a8a310, 0x1, 0x1, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:1057 +0x80b
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNodes(0xc0012db840, 0x2bcf040, 0xc001360600, 0x0, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:489 +0x3ae
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit(0xc0012db840, 0x2bcf040, 0xc001360600, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:166 +0x7d1
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit-fm(0x2bcf040, 0xc001360600, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:105 +0x52
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).init(0xc0012db840, 0x2bcf040, 0xc001360600, 0xc0019ceba0, 0x0, 0x0, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:174 +0x69e
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2(0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:122 +0xf9
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run(0xc0012db840, 0x2bcf040, 0xc0001920c0, 0x0, 0x0)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0x62e
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func6(0x2bcf040, 0xc0001920c0, 0xc0012db840)
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1059 +0x56
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader
    C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1058 +0xf6c

stack trace is very similar to issue described here: https://github.com/moby/moby/issues/38287

so in my case I had a swarm cluster with 3 managers, than I demote 2 of them. And after this docker start crashing 100% on the manager node.

andrey-ko avatar Feb 08 '19 23:02 andrey-ko

Codecov Report

Merging #2819 into master will increase coverage by 0.01%. The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2819      +/-   ##
==========================================
+ Coverage   61.94%   61.95%   +0.01%     
==========================================
  Files         139      139              
  Lines       22120    22122       +2     
==========================================
+ Hits        13702    13706       +4     
- Misses       6937     6939       +2     
+ Partials     1481     1477       -4

codecov[bot] avatar Feb 09 '19 00:02 codecov[bot]

ping @dperny @anshulpundir PTAL

thaJeztah avatar Feb 09 '19 14:02 thaJeztah

whoops, needs a rebase @andrey-ko 🤗

thaJeztah avatar Feb 12 '19 23:02 thaJeztah

i wonder if this is related to #2764...?

dperny avatar Feb 13 '19 20:02 dperny

i wonder if this is related to #2764...?

Doesn't look like it would cause the issue, because in "restoring" mode, it would return before reaching that part https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1041-L1043

In the "non-restore" case, the attachment would be created (if a given network was not yet found in the list of attachments https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1020-L1028

I do wonder though if (in the "restoring" case), we should such attachments in (or before) the first loop; https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L995-L1007

(i.e., remove all empty attachments first)

thaJeztah avatar Feb 17 '19 14:02 thaJeztah

I don't want to merge this without a root cause. The network allocator is absurdly fragile, and making changes without understanding them almost always causes further breakage. I'm looking at the code now.

dperny avatar Feb 19 '19 20:02 dperny

@andrey-ko Have you reproduced this crash anywhere?

dperny avatar Feb 19 '19 20:02 dperny

Is it possible that a network was deleted around the same time that the leadership change occurred?

dperny avatar Feb 19 '19 20:02 dperny

Are we even sure it's the attachment network that's nil, and not the attachment itself? The line includes if _, ok := nwIDs[attach.Network.ID], which means either the attachment OR the Network ID could be nil.

dperny avatar Feb 19 '19 20:02 dperny

I hate the allocator.

dperny avatar Feb 19 '19 20:02 dperny

Is it possible that a network was deleted around the same time that the leadership change occurred?

So it definitely happened after leadership change occurred, but I don't remember if deleted any network around the same time when this event happened

andrey-ko avatar Feb 19 '19 20:02 andrey-ko

ve you reproduced this crash any

I'm going to try to reproduce it again, but it may take some time, I don't remember exact sequence of events that led to this

andrey-ko avatar Feb 19 '19 20:02 andrey-ko

one of the reasoning to add this check, was that there is similar check in this function a little bit earlier https://github.com/docker/swarmkit/blob/bf7b2cb3d8a1d10fa96e836239ff6abf1459badd/manager/allocator/network.go#L1003 so I assumed it is possible to have null network in attachment

andrey-ko avatar Feb 19 '19 21:02 andrey-ko

ping @dperny PTAL

thaJeztah avatar Mar 27 '19 10:03 thaJeztah

Hi guys, i think i'm having the same issue on my production swarm. Almost every time i deploy/remove a stack i would get the same error. So firstly my docker leader would exit followed by the other managers. However after waiting a certain period i would be able to successfully deploy all the stacks to the swarm.

docker node ls:

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
ksxkgs5zt8akxe0ap3zr9dh4h     UNI-VMTST002        Ready               Active                                  18.03.1-ce
r64e1bdkgd8k5mdy0hwsr4xox *   uni-vmsvr30010      Ready               Active              Reachable           19.03.2
2qdiplbssg4mbt578rt91kn5g     uni-vmsvr30011      Ready               Active              Leader              18.03.1-ce
ibayugacan7nvfnqcutvgipeu     uni-vmsvr30012      Ready               Active              Reachable           19.03.1

Logs:

Sep 10 20:38:28 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:28.329163232+08:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:28 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:28.329197180+08:00" level=info msg="waiting 1.375068579s before registering session" module=node/agent node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347410569+08:00" level=info msg="3677eedf18e41c22 is starting a new election at term 4758" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347462517+08:00" level=info msg="3677eedf18e41c22 became candidate at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347482630+08:00" level=info msg="3677eedf18e41c22 received MsgVoteResp from 3677eedf18e41c22 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347497730+08:00" level=info msg="3677eedf18e41c22 [logterm: 4758, index: 17917234] sent MsgVote request to d72e99397b72e11 at term 4759" module=raft node.id=r64e1bdk
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347510672+08:00" level=info msg="3677eedf18e41c22 [logterm: 4758, index: 17917234] sent MsgVote request to 543885594517f92f at term 4759" module=raft node.id=r64e1bd
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347524057+08:00" level=info msg="raft.node: 3677eedf18e41c22 lost leader d72e99397b72e11 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350930604+08:00" level=info msg="3677eedf18e41c22 received MsgVoteResp from 543885594517f92f at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350958464+08:00" level=info msg="3677eedf18e41c22 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350981629+08:00" level=info msg="3677eedf18e41c22 became leader at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350996379+08:00" level=info msg="raft.node: 3677eedf18e41c22 elected leader 3677eedf18e41c22 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.352796360+08:00" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.352915264+08:00" level=info msg="leadership changed from not yet part of a raft cluster to r64e1bdkgd8k5mdy0hwsr4xox" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.353048497+08:00" level=info msg="dispatcher starting" module=dispatcher node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.373961675+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.384913365+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.385125825+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442209977+08:00" level=error msg="IPAM failure while releasing IP address 10.0.11.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442256535+08:00" level=error msg="IPAM failure while releasing IP address 10.0.21.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442269586+08:00" level=error msg="IPAM failure while releasing IP address 10.0.13.4/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442279521+08:00" level=error msg="IPAM failure while releasing IP address 10.0.56.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442292200+08:00" level=error msg="error deallocating attachment for network kkcbtuhbcmnx1pw7xg3cfd98m on node 2qdiplbssg4mbt578rt91kn5g" error="could not find networ
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442308186+08:00" level=error msg="error deallocating attachment for network omala7bwk6ezqp8ay4d9n7pp7 on node 2qdiplbssg4mbt578rt91kn5g" error="could not find networ
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442321474+08:00" level=error msg="IPAM failure while releasing IP address 10.0.3.3/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442334972+08:00" level=error msg="IPAM failure while releasing IP address 10.0.16.8/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x56513d90a1eb]
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: goroutine 16421 [running]:
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNode(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0xc005405e00, 0x0, 0xc00420ed80, 0x7, 0x8, 0
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:1055 +0x75b
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNodes(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0x0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:489 +0x1dd
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:166 +0x4ab
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).init(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0xc00129f200, 0x0, 0x0, 0xc005c89ad0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:174 +0x496
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2(0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:122 +0xbc
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run(0xc003ecb600, 0x56513eb51fe0, 0xc000fbd1c0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0x391
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func6(0x56513eb51fe0, 0xc000fbd1c0, 0xc003ecb600)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1065 +0x5f
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]:         /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1064 +0xf8e
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Unit entered failed state.
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Failed with result 'exit-code'.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: Stopped Docker Application Container Engine.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: Starting Docker Application Container Engine...
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.556082894+08:00" level=info msg="Starting up"
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.556988572+08:00" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558032385+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558055991+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558078720+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0  <nil>}] }" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558090946+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558156800+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc00081cf90, CONNECTING" module=grpc

zhonghuiwen avatar Sep 10 '19 14:09 zhonghuiwen