swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

Tests failures on master

Open abronan opened this issue 7 years ago • 10 comments

Hi there,

Just an FYI, I simply tried running the tests on master but I get recurrent failures. Interestingly, I don't get these failures if I go way back in the commit SHA (say, commit 2c47a1eacc5d30deb8c5fab4f43fdb224392f8eb, where I get no errors).

The first one is network related:

ok  	github.com/docker/swarmkit/manager	4.489s
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating network another_unIque_id during init" error="plugin store is uninitialized"
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating network one_unIque_id during init" error="plugin store is uninitialized"
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating task predTaskID1 during init" error="network one_unIque_id attached to task predTaskID1 not allocated yet"
time="2017-12-22T19:46:32+01:00" level=error msg="failed allocating task predTaskID2 during init" error="network another_unIque_id attached to task predTaskID2 not allocated yet"
time="2017-12-22T19:46:32+01:00" level=error msg="task allocation failure" error="failed to retrieve network testID3 while allocating task testTaskID3"
time="2017-12-22T19:46:32+01:00" level=error msg="Failed allocation for network testID5" error="failed while allocating driver state for network testID5: could not obtain vxlan id for pool 10.0.4.0/24: requested bit is already allocated"
time="2017-12-22T19:46:33+01:00" level=error msg="task allocation failure" error="network testID5 attached to task testTaskID6 not allocated yet"
time="2017-12-22T19:46:33+01:00" level=error msg="Failed allocation for service testServiceID4" error="requested bit is already allocated"
time="2017-12-22T19:46:33+01:00" level=error msg="task allocation failure" error="service testServiceID4 to which this task testTaskID7 belongs has pending allocations"
time="2017-12-22T19:46:38+01:00" level=error msg="Failed allocation for network bridge" error="plugin store is uninitialized"
--- FAIL: TestNodeAllocator (5.02s)
	allocator_test.go:1080: timed out before watchNetwork found expected network state
FAIL
FAIL	github.com/docker/swarmkit/manager/allocator	10.985s

The second one is on raft's TestReadRepairWAL test:

ok  	github.com/docker/swarmkit/manager/state/raft/membership	12.744s
2017-12-22 19:47:16.064983 N | wal: repairing /var/folders/qn/p32tj9m55316gyl1771qzpv80000gn/T/waltests666632251/0000000000000000-0000000000000000.wal
time="2017-12-22T19:47:16+01:00" level=info msg="repaired WAL error" error="unexpected EOF"
2017-12-22 19:47:16.075312 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
--- FAIL: TestReadRepairWAL (1.49s)
        Error Trace:    walwrap_test.go:249
	Error:		Received unexpected error invalid argument
			irreparable WAL error
			github.com/docker/swarmkit/manager/state/raft/storage.ReadRepairWAL
				/Users/abronan/go/src/github.com/docker/swarmkit/manager/state/raft/storage/walwrap.go:179
			github.com/docker/swarmkit/manager/state/raft/storage.TestReadRepairWAL
				/Users/abronan/go/src/github.com/docker/swarmkit/manager/state/raft/storage/walwrap_test.go:248
			testing.tRunner
				/usr/local/Cellar/go/1.9.2/libexec/src/testing/testing.go:746
			runtime.goexit
				/usr/local/Cellar/go/1.9.2/libexec/src/runtime/asm_amd64.s:2337

FAIL
FAIL	github.com/docker/swarmkit/manager/state/raft/storage	18.314s

Both these failures happen every time I run the tests.

Now I'm not sure if there is anything special about my setup that would make the tests fail (ran these tests on a Mac), but I get something similar running this in the official Golang image with some variations.

For the first part related to networking:

ok  	github.com/docker/swarmkit/manager	2.909s
time="2017-12-22T18:58:51Z" level=error msg="task allocation failure" error="failed to retrieve network testID3 while allocating task testTaskID3"
time="2017-12-22T18:58:52Z" level=error msg="Failed allocation for network testID5" error="failed while allocating driver state for network testID5: could not obtain vxlan id for pool 10.0.4.0/24: requested bit is already allocated"
time="2017-12-22T18:58:52Z" level=error msg="task allocation failure" error="network testID5 attached to task testTaskID6 not allocated yet"
time="2017-12-22T18:58:52Z" level=error msg="Failed allocation for service testServiceID4" error="requested bit is already allocated"
time="2017-12-22T18:58:52Z" level=error msg="task allocation failure" error="service testServiceID4 to which this task testTaskID7 belongs has pending allocations"
--- FAIL: TestAllocatorRestoreForDuplicateIPs (5.01s)
	allocator_test.go:1146: timed out before watchTask found expected task state goroutine 332 [running]:
		runtime/debug.Stack(0xc4215d9b50, 0x1, 0x0)
			/usr/local/go/src/runtime/debug/stack.go:24 +0xb5
		github.com/docker/swarmkit/manager/allocator.watchTask(0xc4201d81e0, 0xc421765f80, 0xc4201d2480, 0x0, 0xc4215d9d08)
			/go/src/github.com/docker/swarmkit/manager/allocator/allocator_test.go:1146 +0x3c6
		github.com/docker/swarmkit/manager/allocator.TestAllocatorRestoreForDuplicateIPs(0xc4201d81e0)
			/go/src/github.com/docker/swarmkit/manager/allocator/allocator_test.go:794 +0xd0e
		testing.tRunner(0xc4201d81e0, 0xfb70f8)
			/usr/local/go/src/testing/testing.go:746 +0x16d
		created by testing.(*T).Run
			/usr/local/go/src/testing/testing.go:789 +0x569
--- FAIL: TestNodeAllocator (0.02s)
	allocator_test.go:1047: timed out before watchNode found expected node state
FAIL
FAIL	github.com/docker/swarmkit/manager/allocator	9.783s

For raft's part:

ok  	github.com/docker/swarmkit/manager/state/raft/membership	10.738s
--- FAIL: TestCreateOpenInvalidDirFails (0.01s)
        Error Trace:    walwrap_test.go:198
	Error:		An error is expected but got nil.

2017-12-22 18:59:54.474686 N | wal: repairing /tmp/waltests771829171/0000000000000000-0000000000000000.wal
time="2017-12-22T18:59:54Z" level=info msg="repaired WAL error" error="unexpected EOF"
2017-12-22 18:59:54.477136 W | wal: ignored file 0000000000000000-0000000000000000.wal.broken in wal
FAIL
FAIL	github.com/docker/swarmkit/manager/state/raft/storage	0.762s

I'll have to try running these tests on my Linux machine and check, just thought this would be helpful to share as is.

abronan avatar Dec 22 '17 19:12 abronan

Seems like it's just mac OS/Docker for Mac related as I have the build succeeding on my ArchLinux machine but the build failing on two different mac OS machines (on High-Sierra).

abronan avatar Jan 01 '18 13:01 abronan

Seems like it's just mac OS/Docker for Mac related

I don't think either of these test paths are machine specific, so this is a bit of a surprise.

Would you be willing to do a more accurate bisection? :P

stevvooe avatar Jan 02 '18 22:01 stevvooe

Yeah seems surprising to me as well. Sorry I didn't have much time recently, I was pretty busy. I started working on extracting the raft/store packages for another use, I should be able to find out what's going on here depending on how much time I have allocated for this.

Still encountering the error (on the raft part at least) after extracting the core parts of it.

abronan avatar Feb 05 '18 11:02 abronan

I can already say that upgrading the etcd/raft dependency to v3.3.0 fixes the issue somehow. Still a mystery why though, have to visit etcd's tree to figure out what broke in that specific version vendored in swarmkit.

abronan avatar Feb 05 '18 14:02 abronan

Thanks for filing this! I've encountered the same issues, and also on 2 different High Sierra machines, so was wondering why. I still get the WAL read/repair error on https://github.com/docker/swarmkit/commit/2c47a1eacc5d30deb8c5fab4f43fdb224392f8eb, but not the network allocator error.

cyli avatar Feb 23 '18 01:02 cyli

I wonder if it's https://github.com/coreos/etcd/pull/8921, possibly something changed in high sierra?

cyli avatar Feb 23 '18 18:02 cyli

@cyli That would make sense yes, as updating the vendoring to v3.3.0 included this change on my side. I haven't tried vendoring earlier versions with High-Sierra yet to see if I was encountering the same breakage though.

Also, welcome back 🎉 🙂

abronan avatar Feb 23 '18 19:02 abronan

For TestReadRepairWAL, I'm think it is https://github.com/coreos/etcd/pull/8921, because if I pull only those changes in that test passes for me in master.

cyli avatar Feb 26 '18 23:02 cyli

We can upgrade the raft vendoring to v3.3.0, but that seems somewhat risky without a full review to look for incompatibilities. Other possibilities are:

  • we pull in those changes specifically to fix the one TestReadRepairWAL test, and vendor a fork
  • we skip the test on mac os. :|

cyli avatar Mar 14 '18 18:03 cyli

Bit of a fun ride; opened a PR to fix TestCreateOpenInvalidDirFails; https://github.com/moby/swarmkit/pull/3087

thaJeztah avatar Nov 19 '22 17:11 thaJeztah