swarmkit
swarmkit copied to clipboard
Tests failures on master
Hi there,
Just an FYI, I simply tried running the tests on master but I get recurrent failures. Interestingly, I don't get these failures if I go way back in the commit SHA (say, commit 2c47a1eacc5d30deb8c5fab4f43fdb224392f8eb, where I get no errors).
The first one is network related:
ok github.com/docker/swarmkit/manager 4.489s
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating network another_unIque_id during init" error="plugin store is uninitialized"
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating network one_unIque_id during init" error="plugin store is uninitialized"
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating task predTaskID1 during init" error="network one_unIque_id attached to task predTaskID1 not allocated yet"
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating task predTaskID2 during init" error="network another_unIque_id attached to task predTaskID2 not allocated yet"
time="2017-12-22T19:46:32+01:00" level=error msg="task allocation failure" error="failed to retrieve network testID3 while allocating task testTaskID3"
time="2017-12-22T19:46:32+01:00" level=error msg="Failed allocation for network testID5" error="failed while allocating driver state for network testID5: could not obtain vxlan id for pool 10.0.4.0/24: requested bit is already allocated"
time="2017-12-22T19:46:33+01:00" level=error msg="task allocation failure" error="network testID5 attached to task testTaskID6 not allocated yet"
time="2017-12-22T19:46:33+01:00" level=error msg="Failed allocation for service testServiceID4" error="requested bit is already allocated"
time="2017-12-22T19:46:33+01:00" level=error msg="task allocation failure" error="service testServiceID4 to which this task testTaskID7 belongs has pending allocations"
time="2017-12-22T19:46:38+01:00" level=error msg="Failed allocation for network bridge" error="plugin store is uninitialized"
--- FAIL: TestNodeAllocator (5.02s)
allocator_test.go:1080: timed out before watchNetwork found expected network state
FAIL
FAIL github.com/docker/swarmkit/manager/allocator 10.985s
The second one is on raft's TestReadRepairWAL test:
ok github.com/docker/swarmkit/manager/state/raft/membership 12.744s
2017-12-22 19:47:16.064983 N | wal: repairing /var/folders/qn/p32tj9m55316gyl1771qzpv80000gn/T/waltests666632251/0000000000000000-0000000000000000.wal
time="2017-12-22T19:47:16+01:00" level=info msg="repaired WAL error" error="unexpected EOF"
2017-12-22 19:47:16.075312 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
--- FAIL: TestReadRepairWAL (1.49s)
Error Trace: walwrap_test.go:249
Error: Received unexpected error invalid argument
irreparable WAL error
github.com/docker/swarmkit/manager/state/raft/storage.ReadRepairWAL
/Users/abronan/go/src/github.com/docker/swarmkit/manager/state/raft/storage/walwrap.go:179
github.com/docker/swarmkit/manager/state/raft/storage.TestReadRepairWAL
/Users/abronan/go/src/github.com/docker/swarmkit/manager/state/raft/storage/walwrap_test.go:248
testing.tRunner
/usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:746
runtime.goexit
/usr/local/Cellar/go/1.9.2/libexec/src/runtime/asm_amd64.s:2337
FAIL
FAIL github.com/docker/swarmkit/manager/state/raft/storage 18.314s
Both these failures happen every time I run the tests.
Now I'm not sure if there is anything special about my setup that would make the tests fail (ran these tests on a Mac), but I get something similar running this in the official Golang image with some variations.
For the first part related to networking:
ok github.com/docker/swarmkit/manager 2.909s
time="2017-12-22T18:58:51Z" level=error msg="task allocation failure" error="failed to retrieve network testID3 while allocating task testTaskID3"
time="2017-12-22T18:58:52Z" level=error msg="Failed allocation for network testID5" error="failed while allocating driver state for network testID5: could not obtain vxlan id for pool 10.0.4.0/24: requested bit is already allocated"
time="2017-12-22T18:58:52Z" level=error msg="task allocation failure" error="network testID5 attached to task testTaskID6 not allocated yet"
time="2017-12-22T18:58:52Z" level=error msg="Failed allocation for service testServiceID4" error="requested bit is already allocated"
time="2017-12-22T18:58:52Z" level=error msg="task allocation failure" error="service testServiceID4 to which this task testTaskID7 belongs has pending allocations"
--- FAIL: TestAllocatorRestoreForDuplicateIPs (5.01s)
allocator_test.go:1146: timed out before watchTask found expected task state goroutine 332 [running]:
runtime/debug.Stack(0xc4215d9b50, 0x1, 0x0)
/usr/local/go/src/runtime/debug/stack.go:24 +0xb5
github.com/docker/swarmkit/manager/allocator.watchTask(0xc4201d81e0, 0xc421765f80, 0xc4201d2480, 0x0, 0xc4215d9d08)
/go/src/github.com/docker/swarmkit/manager/allocator/allocator_test.go:1146 +0x3c6
github.com/docker/swarmkit/manager/allocator.TestAllocatorRestoreForDuplicateIPs(0xc4201d81e0)
/go/src/github.com/docker/swarmkit/manager/allocator/allocator_test.go:794 +0xd0e
testing.tRunner(0xc4201d81e0, 0xfb70f8)
/usr/local/go/src/testing/testing.go:746 +0x16d
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:789 +0x569
--- FAIL: TestNodeAllocator (0.02s)
allocator_test.go:1047: timed out before watchNode found expected node state
FAIL
FAIL github.com/docker/swarmkit/manager/allocator 9.783s
For raft's part:
ok github.com/docker/swarmkit/manager/state/raft/membership 10.738s
--- FAIL: TestCreateOpenInvalidDirFails (0.01s)
Error Trace: walwrap_test.go:198
Error: An error is expected but got nil.
2017-12-22 18:59:54.474686 N | wal: repairing /tmp/waltests771829171/0000000000000000-0000000000000000.wal
time="2017-12-22T18:59:54Z" level=info msg="repaired WAL error" error="unexpected EOF"
2017-12-22 18:59:54.477136 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
FAIL
FAIL github.com/docker/swarmkit/manager/state/raft/storage 0.762s
I'll have to try running these tests on my Linux machine and check, just thought this would be helpful to share as is.
Seems like it's just mac OS/Docker for Mac related as I have the build succeeding on my ArchLinux machine but the build failing on two different mac OS machines (on High-Sierra).
Seems like it's just mac OS/Docker for Mac related
I don't think either of these test paths are machine specific, so this is a bit of a surprise.
Would you be willing to do a more accurate bisection? :P
Yeah seems surprising to me as well. Sorry I didn't have much time recently, I was pretty busy. I started working on extracting the raft/store packages for another use, I should be able to find out what's going on here depending on how much time I have allocated for this.
Still encountering the error (on the raft part at least) after extracting the core parts of it.
I can already say that upgrading the etcd/raft dependency to v3.3.0 fixes the issue somehow. Still a mystery why though, have to visit etcd's tree to figure out what broke in that specific version vendored in swarmkit.
Thanks for filing this! I've encountered the same issues, and also on 2 different High Sierra machines, so was wondering why. I still get the WAL read/repair error on https://github.com/docker/swarmkit/commit/2c47a1eacc5d30deb8c5fab4f43fdb224392f8eb, but not the network allocator error.
I wonder if it's https://github.com/coreos/etcd/pull/8921, possibly something changed in high sierra?
@cyli That would make sense yes, as updating the vendoring to v3.3.0 included this change on my side. I haven't tried vendoring earlier versions with High-Sierra yet to see if I was encountering the same breakage though.
Also, welcome back 🎉 🙂
For TestReadRepairWAL, I'm think it is https://github.com/coreos/etcd/pull/8921, because if I pull only those changes in that test passes for me in master.
We can upgrade the raft vendoring to v3.3.0, but that seems somewhat risky without a full review to look for incompatibilities. Other possibilities are:
- we pull in those changes specifically to fix the one
TestReadRepairWALtest, and vendor a fork - we skip the test on mac os. :|
Bit of a fun ride; opened a PR to fix TestCreateOpenInvalidDirFails; https://github.com/moby/swarmkit/pull/3087