antrea icon indicating copy to clipboard operation
antrea copied to clipboard

TestConnectivity/testOVSFlowReplay flakiness

Open antoninbas opened this issue 3 years ago • 1 comments
trafficstars

Describe the bug I observed this e2e test failing during an execution of the E2e tests on a Kind cluster on Linux with Antrea-native policies disabled" Kind job.

2022-04-14T23:04:20.2313389Z === RUN   TestConnectivity/testOVSFlowReplay
2022-04-14T23:04:20.2316656Z     connectivity_test.go:379: Creating 2 busybox test Pods on 'kind-worker'
2022-04-14T23:04:20.2468806Z     connectivity_test.go:75: Waiting for Pods to be ready and retrieving IPs
2022-04-14T23:04:35.2520694Z     connectivity_test.go:89: Retrieved all Pod IPs: map[test-pod-0-tdqpoygz:IPv4(10.10.1.9),IPstrings(10.10.1.9) test-pod-1-l3k7mx0w:IPv4(10.10.1.8),IPstrings(10.10.1.8)]
2022-04-14T23:04:35.2521416Z     connectivity_test.go:98: Ping mesh test between all Pods
2022-04-14T23:04:39.3061520Z     connectivity_test.go:115: Ping 'antrea-test/test-pod-0-tdqpoygz' -> 'antrea-test/test-pod-1-l3k7mx0w': OK
2022-04-14T23:04:43.3647605Z     connectivity_test.go:115: Ping 'antrea-test/test-pod-1-l3k7mx0w' -> 'antrea-test/test-pod-0-tdqpoygz': OK
2022-04-14T23:04:43.3680521Z     connectivity_test.go:395: The Antrea Pod for Node 'kind-worker' is 'antrea-agent-8gkrf'
2022-04-14T23:04:43.4281662Z     connectivity_test.go:404: Counted 108 flow in OVS bridge 'br-int' for Node 'kind-worker'
2022-04-14T23:04:43.4965946Z     connectivity_test.go:414: Counted 6 group in OVS bridge 'br-int' for Node 'kind-worker'
2022-04-14T23:04:43.4966684Z     connectivity_test.go:422: Deleting flows / groups and restarting OVS daemons on Node 'kind-worker'
2022-04-14T23:04:43.9287198Z     connectivity_test.go:446: Restarted OVS with ovs-ctl: stdout:  * Saving flows
2022-04-14T23:04:43.9287975Z          * Exiting ovsdb-server (45)
2022-04-14T23:04:43.9288362Z          * Starting ovsdb-server
2022-04-14T23:04:43.9288746Z          * Configuring Open vSwitch system IDs
2022-04-14T23:04:43.9289137Z          * Exiting ovs-vswitchd (97)
2022-04-14T23:04:43.9289508Z          * Starting ovs-vswitchd
2022-04-14T23:04:43.9289820Z          * Restoring saved flows
2022-04-14T23:04:43.9290140Z          * Enabling remote OVSDB managers
2022-04-14T23:04:43.9291145Z          - stderr: 2022-04-14T23:04:43Z|00001|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: version negotiation failed (we support version 0x05, peer supports versions 0x01, 0x04)
2022-04-14T23:04:43.9291829Z         ovs-ofctl: br-int: failed to connect to socket (Broken pipe)
2022-04-14T23:04:43.9292765Z         2022-04-14T23:04:43Z|00001|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: version negotiation failed (we support version 0x05, peer supports versions 0x01, 0x04)
2022-04-14T23:04:43.9293406Z         ovs-ofctl: br-int: failed to connect to socket (Broken pipe)
2022-04-14T23:04:43.9294349Z         2022-04-14T23:04:43Z|00001|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: version negotiation failed (we support version 0x05, peer supports versions 0x01, 0x04)
2022-04-14T23:04:43.9295016Z         ovs-ofctl: br-int: failed to connect to socket (Protocol error)
2022-04-14T23:04:43.9296066Z         2022-04-14T23:04:43Z|00001|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: version negotiation failed (we support version 0x05, peer supports versions 0x01, 0x04)
2022-04-14T23:04:43.9296701Z         ovs-ofctl: br-int: failed to connect to socket (Broken pipe)
2022-04-14T23:04:43.9297227Z     connectivity_test.go:451: Running second ping mesh to check that flows have been restored
2022-04-14T23:04:43.9299420Z     connectivity_test.go:75: Waiting for Pods to be ready and retrieving IPs
2022-04-14T23:04:45.9359264Z     connectivity_test.go:89: Retrieved all Pod IPs: map[test-pod-0-tdqpoygz:IPv4(10.10.1.9),IPstrings(10.10.1.9) test-pod-1-l3k7mx0w:IPv4(10.10.1.8),IPstrings(10.10.1.8)]
2022-04-14T23:04:45.9359892Z     connectivity_test.go:98: Ping mesh test between all Pods
2022-04-14T23:04:49.9959168Z     connectivity_test.go:115: Ping 'antrea-test/test-pod-0-tdqpoygz' -> 'antrea-test/test-pod-1-l3k7mx0w': OK
2022-04-14T23:04:54.0443096Z     connectivity_test.go:115: Ping 'antrea-test/test-pod-1-l3k7mx0w' -> 'antrea-test/test-pod-0-tdqpoygz': OK
2022-04-14T23:04:54.1076331Z     connectivity_test.go:404: Counted 103 flow in OVS bridge 'br-int' for Node 'kind-worker'
2022-04-14T23:04:54.1844760Z     connectivity_test.go:414: Counted 6 group in OVS bridge 'br-int' for Node 'kind-worker'
2022-04-14T23:04:54.1845241Z     connectivity_test.go:455: 
2022-04-14T23:04:54.1845646Z         	Error Trace:	connectivity_test.go:455
2022-04-14T23:04:54.1846236Z         	            				connectivity_test.go:66
2022-04-14T23:04:54.1846648Z         	Error:      	Not equal: 
2022-04-14T23:04:54.1847009Z         	            	expected: 108
2022-04-14T23:04:54.1847435Z         	            	actual  : 103
2022-04-14T23:04:54.1847853Z         	Test:       	TestConnectivity/testOVSFlowReplay
2022-04-14T23:04:54.1848330Z         	Messages:   	Mismatch in OVS flow count after flow replay
2022-04-14T23:04:54.1848863Z     fixtures.go:407: Deleting Pod 'test-pod-1-l3k7mx0w'
2022-04-14T23:04:54.1875728Z     fixtures.go:407: Deleting Pod 'test-pod-0-tdqpoygz'

Note that the OVS error logs are harmless and can be ignored. However, there is a flow count mismatch after the replay. It is hard to troubleshoot the test with the current log levels. I propose to dump the flows in case of test failure: if the test fails again it will be easy to troubleshoot and fix the test to avoid future flakiness.

antoninbas avatar Apr 15 '22 19:04 antoninbas

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

github-actions[bot] avatar Jul 15 '22 00:07 github-actions[bot]