swarmkit
swarmkit copied to clipboard
fix crash when there are attachments with null network
fix for crash that I caught while playing with swarm cluster. here the stack trace of this crash:
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x20257db]
goroutine 5668 [running]:
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNode(0xc0012db840, 0x2bcf040, 0xc001360600, 0xc00008b040, 0x0, 0xc001a8a310, 0x1, 0x1, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:1057 +0x80b
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNodes(0xc0012db840, 0x2bcf040, 0xc001360600, 0x0, 0x0, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:489 +0x3ae
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit(0xc0012db840, 0x2bcf040, 0xc001360600, 0x0, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:166 +0x7d1
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit-fm(0x2bcf040, 0xc001360600, 0x0, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:105 +0x52
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).init(0xc0012db840, 0x2bcf040, 0xc001360600, 0xc0019ceba0, 0x0, 0x0, 0x0, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:174 +0x69e
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2(0x0, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:122 +0xf9
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run(0xc0012db840, 0x2bcf040, 0xc0001920c0, 0x0, 0x0)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0x62e
github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func6(0x2bcf040, 0xc0001920c0, 0xc0012db840)
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1059 +0x56
created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader
C:/Users/docker/go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1058 +0xf6c
stack trace is very similar to issue described here: https://github.com/moby/moby/issues/38287
so in my case I had a swarm cluster with 3 managers, than I demote 2 of them. And after this docker start crashing 100% on the manager node.
Codecov Report
Merging #2819 into master will increase coverage by
0.01%
. The diff coverage is0%
.
@@ Coverage Diff @@
## master #2819 +/- ##
==========================================
+ Coverage 61.94% 61.95% +0.01%
==========================================
Files 139 139
Lines 22120 22122 +2
==========================================
+ Hits 13702 13706 +4
- Misses 6937 6939 +2
+ Partials 1481 1477 -4
ping @dperny @anshulpundir PTAL
whoops, needs a rebase @andrey-ko 🤗
i wonder if this is related to #2764...?
i wonder if this is related to #2764...?
Doesn't look like it would cause the issue, because in "restoring" mode, it would return before reaching that part https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1041-L1043
In the "non-restore" case, the attachment would be created (if a given network was not yet found in the list of attachments https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L1020-L1028
I do wonder though if (in the "restoring" case), we should such attachments in (or before) the first loop; https://github.com/docker/swarmkit/blob/5be0152d20e2b514ebdfc56b04d727454a7021bb/manager/allocator/network.go#L995-L1007
(i.e., remove all empty attachments first)
I don't want to merge this without a root cause. The network allocator is absurdly fragile, and making changes without understanding them almost always causes further breakage. I'm looking at the code now.
@andrey-ko Have you reproduced this crash anywhere?
Is it possible that a network was deleted around the same time that the leadership change occurred?
Are we even sure it's the attachment network that's nil, and not the attachment itself? The line includes if _, ok := nwIDs[attach.Network.ID]
, which means either the attachment OR the Network ID could be nil.
I hate the allocator.
Is it possible that a network was deleted around the same time that the leadership change occurred?
So it definitely happened after leadership change occurred, but I don't remember if deleted any network around the same time when this event happened
ve you reproduced this crash any
I'm going to try to reproduce it again, but it may take some time, I don't remember exact sequence of events that led to this
one of the reasoning to add this check, was that there is similar check in this function a little bit earlier https://github.com/docker/swarmkit/blob/bf7b2cb3d8a1d10fa96e836239ff6abf1459badd/manager/allocator/network.go#L1003 so I assumed it is possible to have null network in attachment
ping @dperny PTAL
Hi guys, i think i'm having the same issue on my production swarm. Almost every time i deploy/remove a stack i would get the same error. So firstly my docker leader would exit followed by the other managers. However after waiting a certain period i would be able to successfully deploy all the stacks to the swarm.
docker node ls:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
ksxkgs5zt8akxe0ap3zr9dh4h UNI-VMTST002 Ready Active 18.03.1-ce
r64e1bdkgd8k5mdy0hwsr4xox * uni-vmsvr30010 Ready Active Reachable 19.03.2
2qdiplbssg4mbt578rt91kn5g uni-vmsvr30011 Ready Active Leader 18.03.1-ce
ibayugacan7nvfnqcutvgipeu uni-vmsvr30012 Ready Active Reachable 19.03.1
Logs:
Sep 10 20:38:28 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:28.329163232+08:00" level=info msg="manager selected by agent for new session: { }" module=node/agent node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:28 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:28.329197180+08:00" level=info msg="waiting 1.375068579s before registering session" module=node/agent node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347410569+08:00" level=info msg="3677eedf18e41c22 is starting a new election at term 4758" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347462517+08:00" level=info msg="3677eedf18e41c22 became candidate at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347482630+08:00" level=info msg="3677eedf18e41c22 received MsgVoteResp from 3677eedf18e41c22 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347497730+08:00" level=info msg="3677eedf18e41c22 [logterm: 4758, index: 17917234] sent MsgVote request to d72e99397b72e11 at term 4759" module=raft node.id=r64e1bdk
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347510672+08:00" level=info msg="3677eedf18e41c22 [logterm: 4758, index: 17917234] sent MsgVote request to 543885594517f92f at term 4759" module=raft node.id=r64e1bd
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.347524057+08:00" level=info msg="raft.node: 3677eedf18e41c22 lost leader d72e99397b72e11 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350930604+08:00" level=info msg="3677eedf18e41c22 received MsgVoteResp from 543885594517f92f at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350958464+08:00" level=info msg="3677eedf18e41c22 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350981629+08:00" level=info msg="3677eedf18e41c22 became leader at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.350996379+08:00" level=info msg="raft.node: 3677eedf18e41c22 elected leader 3677eedf18e41c22 at term 4759" module=raft node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.352796360+08:00" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.352915264+08:00" level=info msg="leadership changed from not yet part of a raft cluster to r64e1bdkgd8k5mdy0hwsr4xox" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.353048497+08:00" level=info msg="dispatcher starting" module=dispatcher node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.373961675+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.384913365+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.385125825+08:00" level=error msg="failed to delete task" error="object does not exist" module=node node.id=r64e1bdkgd8k5mdy0hwsr4xox
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442209977+08:00" level=error msg="IPAM failure while releasing IP address 10.0.11.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442256535+08:00" level=error msg="IPAM failure while releasing IP address 10.0.21.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442269586+08:00" level=error msg="IPAM failure while releasing IP address 10.0.13.4/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442279521+08:00" level=error msg="IPAM failure while releasing IP address 10.0.56.2/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442292200+08:00" level=error msg="error deallocating attachment for network kkcbtuhbcmnx1pw7xg3cfd98m on node 2qdiplbssg4mbt578rt91kn5g" error="could not find networ
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442308186+08:00" level=error msg="error deallocating attachment for network omala7bwk6ezqp8ay4d9n7pp7 on node 2qdiplbssg4mbt578rt91kn5g" error="could not find networ
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442321474+08:00" level=error msg="IPAM failure while releasing IP address 10.0.3.3/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: time="2019-09-10T20:38:31.442334972+08:00" level=error msg="IPAM failure while releasing IP address 10.0.16.8/24" error="invalid pool id: "
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x56513d90a1eb]
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: goroutine 16421 [running]:
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNode(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0xc005405e00, 0x0, 0xc00420ed80, 0x7, 0x8, 0
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:1055 +0x75b
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).allocateNodes(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0x0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:489 +0x1dd
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkInit(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/network.go:166 +0x4ab
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).init(0xc003ecb600, 0x56513eb51fe0, 0xc00479f1c0, 0xc00129f200, 0x0, 0x0, 0xc005c89ad0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:174 +0x496
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2(0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:122 +0xbc
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator.(*Allocator).Run(0xc003ecb600, 0x56513eb51fe0, 0xc000fbd1c0, 0x0, 0x0)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0x391
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader.func6(0x56513eb51fe0, 0xc000fbd1c0, 0xc003ecb600)
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1065 +0x5f
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: created by github.com/docker/docker/vendor/github.com/docker/swarmkit/manager.(*Manager).becomeLeader
Sep 10 20:38:31 uni-vmsvr30010 dockerd[25519]: /go/src/github.com/docker/docker/vendor/github.com/docker/swarmkit/manager/manager.go:1064 +0xf8e
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Unit entered failed state.
Sep 10 20:38:31 uni-vmsvr30010 systemd[1]: docker.service: Failed with result 'exit-code'.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: Stopped Docker Application Container Engine.
Sep 10 20:38:33 uni-vmsvr30010 systemd[1]: Starting Docker Application Container Engine...
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.556082894+08:00" level=info msg="Starting up"
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.556988572+08:00" level=warning msg="[!] DON'T BIND ON ANY IP ADDRESS WITHOUT setting --tlsverify IF YOU DON'T KNOW WHAT YOU'RE DOING [!]"
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558032385+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558055991+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558078720+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}] }" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558090946+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Sep 10 20:38:33 uni-vmsvr30010 dockerd[2852]: time="2019-09-10T20:38:33.558156800+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc00081cf90, CONNECTING" module=grpc