test(e2e): Implement retriers for tests that could fail due to flakiness
I dig through some recent failures to find out the common causes and came up with following in order of recency
1
Atleast 2 times and most recent
2024/10/08 19:37:49 executing command "nslookup kubernetes.default" on pod "agnhost-basic-dns-port-forward-2360488725034690767-0" in namespace "kube-system"...
runner.go:27:
Error Trace: /home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:77
Error: Received unexpected error:
did not expect error from step ExecInPod but got error: error executing command [nslookup kubernetes.default]: error executing command: error dialing backend: EOF
Test: TestE2ERetina
2
2024/09/20 15:47:12 checking for metrics on http://localhost:10093/metrics
runner.go:27:
Error Trace: /home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
Error: Received unexpected error:
did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: could not start port forward within 300000000000s: HTTP request failed: Get "http://localhost:10093/metrics": dial tcp [::1]:10093: connect: connection refused
Test: TestE2ERetina
3
2024/08/29 16:04:58 attempting to find pod with label "k8s-app=retina", on a node with a pod with label "app=agnhost-a"
runner.go:27:
Error Trace: /home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
Error: Received unexpected error:
did not expect error from step PortForward but got error: could not find pod with affinity: could not find a pod with label "k8s-app=retina", on a node that also has a pod with label "app=agnhost-a": no pod with label found with matching pod affinity
Test: TestE2ERetina
4
Atleast 2 times
2024/08/27 17:47:16 failed to create cluster: context deadline exceeded
runner.go:27:
Error Trace: /home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:53
Error: Received unexpected error:
did not expect error from step CreateNPMCluster but got error: failed to create cluster: context deadline exceeded
Test: TestE2ERetina
5
2024/08/26 23:50:24 failed to find metric matching networkobservability_adv_dns_request_count: map[ip:10.224.4.108 namespace:kube-system podname:agnhost-adv-dns-port-forward-4243681485157638409-0 query:kubernetes.default.svc.cluster.local. query_type:A workload_kind:StatefulSet workload_name:agnhost-adv-dns-port-forward-4243681485157638409]
runner.go:27:
Error Trace: /home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
Error: Received unexpected error:
did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: no metric found
Test: TestE2ERetina
6
Most of the older errors, atleast 5 times
2024/08/23 15:53:02 Error received when checking status of resource retina-svc. Error: 'client rate limiter Wait returned an error: context deadline exceeded', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"
Name: "retina-svc", Namespace: "kube-system"'
2024/08/23 15:53:02 Retryable error? true
2024/08/23 15:53:02 Retrying as current number of retries 0 less than max number of retries 30
runner.go:27:
Error Trace: /home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
Error: Received unexpected error:
did not expect error from step InstallHelmChart but got error: failed to install chart: context deadline exceeded
Test: TestE2ERetina
Let's triage one by one
For 1
The error is most likely coming from this
{
Step: &kubernetes.ExecInPod{
PodName: podName,
PodNamespace: "kube-system",
Command: req.Command,
},
Opts: &types.StepOptions{
ExpectError: req.ExpectError,
SkipSavingParametersToJob: true,
},
},
The ExecInPod this does not has any inbuilt retry mechanism, so any failures in the step calling this function will result in the test failing. We should add a retry strategy that would help us make the running commands resilient.
Potential tool: https://pkg.go.dev/k8s.io/client-go/util/retry
I also think we would benefit by adding these almost all k8s operation that has a change of failing due to network issues. The same solution can be used for 3rd one as well
For 2
We already have retry on port forwading, I can see if I can potentially tune, but other than that I don't see any potential fix for this kind of intermittent issue. My idea would be spread the retry in order to rule out network related issue.
For 4,
maybe increase the time limit and see if the issue is still there. The downside would be that the test would be stuck for longer in case of an actual issue.
For 6
We are being throttled by the k8s api engine, I would reduce the polling frequency to allow more tries at polling, We would have same tradeoff as last one. I also would like to log the state of retina agents and maybe evaluate on runtime if we need to end the test.
Number 6 was already addressed by increasing the timeout from 8 min to 20 min, hence we don't require much there. I have addressed 1,2 & 4 in PR #867
We have fixed all identified flakiness, closing the issue!