retina test(e2e): Implement retriers for tests that could fail due to flakiness

Jun 06 '24 19:06 nddq

I dig through some recent failures to find out the common causes and came up with following in order of recency

1

Atleast 2 times and most recent

2024/10/08 19:37:49 executing command "nslookup kubernetes.default" on pod "agnhost-basic-dns-port-forward-2360488725034690767-0" in namespace "kube-system"...
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:77
        	Error:      	Received unexpected error:
        	            	did not expect error from step ExecInPod but got error: error executing command [nslookup kubernetes.default]: error executing command: error dialing backend: EOF
        	Test:       	TestE2ERetina

2

2024/09/20 15:47:12 checking for metrics on http://localhost:10093/metrics
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
        	Error:      	Received unexpected error:
        	            	did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: could not start port forward within 300000000000s: HTTP request failed: Get "http://localhost:10093/metrics": dial tcp [::1]:10093: connect: connection refused	
        	Test:       	TestE2ERetina

3

2024/08/29 16:04:58 attempting to find pod with label "k8s-app=retina", on a node with a pod with label "app=agnhost-a"
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
        	Error:      	Received unexpected error:
        	            	did not expect error from step PortForward but got error: could not find pod with affinity: could not find a pod with label "k8s-app=retina", on a node that also has a pod with label "app=agnhost-a": no pod with label found with matching pod affinity
        	Test:       	TestE2ERetina

4

Atleast 2 times

2024/08/27 17:47:16 failed to create cluster: context deadline exceeded
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:53
        	Error:      	Received unexpected error:
        	            	did not expect error from step CreateNPMCluster but got error: failed to create cluster: context deadline exceeded
        	Test:       	TestE2ERetina

5

2024/08/26 23:50:24 failed to find metric matching networkobservability_adv_dns_request_count: map[ip:10.224.4.108 namespace:kube-system podname:agnhost-adv-dns-port-forward-4243681485157638409-0 query:kubernetes.default.svc.cluster.local. query_type:A workload_kind:StatefulSet workload_name:agnhost-adv-dns-port-forward-4243681485157638409]
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
        	Error:      	Received unexpected error:
        	            	did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: no metric found
        	Test:       	TestE2ERetina

6

Most of the older errors, atleast 5 times

2024/08/23 15:53:02 Error received when checking status of resource retina-svc. Error: 'client rate limiter Wait returned an error: context deadline exceeded', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"
Name: "retina-svc", Namespace: "kube-system"'
2024/08/23 15:53:02 Retryable error? true
2024/08/23 15:53:02 Retrying as current number of retries 0 less than max number of retries 30
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
        	Error:      	Received unexpected error:
        	            	did not expect error from step InstallHelmChart but got error: failed to install chart: context deadline exceeded
        	Test:       	TestE2ERetina

Oct 09 '24 17:10 ritwikranjan

Let's triage one by one

For 1

The error is most likely coming from this

		{
			Step: &kubernetes.ExecInPod{
				PodName:      podName,
				PodNamespace: "kube-system",
				Command:      req.Command,
			},
			Opts: &types.StepOptions{
				ExpectError:               req.ExpectError,
				SkipSavingParametersToJob: true,
			},
		},

The ExecInPod this does not has any inbuilt retry mechanism, so any failures in the step calling this function will result in the test failing. We should add a retry strategy that would help us make the running commands resilient. Potential tool: https://pkg.go.dev/k8s.io/client-go/util/retry

I also think we would benefit by adding these almost all k8s operation that has a change of failing due to network issues. The same solution can be used for 3rd one as well

For 2

We already have retry on port forwading, I can see if I can potentially tune, but other than that I don't see any potential fix for this kind of intermittent issue. My idea would be spread the retry in order to rule out network related issue.

For 4,

maybe increase the time limit and see if the issue is still there. The downside would be that the test would be stuck for longer in case of an actual issue.

For 6

We are being throttled by the k8s api engine, I would reduce the polling frequency to allow more tries at polling, We would have same tradeoff as last one. I also would like to log the state of retina agents and maybe evaluate on runtime if we need to end the test.

Oct 10 '24 10:10 ritwikranjan

Number 6 was already addressed by increasing the timeout from 8 min to 20 min, hence we don't require much there. I have addressed 1,2 & 4 in PR #867

Oct 16 '24 13:10 ritwikranjan

We have fixed all identified flakiness, closing the issue!

Oct 24 '24 08:10 ritwikranjan