cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

TestKubernetesIntegrationRecipe is flaky in IPv6 Kind cluster

Open pebrc opened this issue 4 years ago • 6 comments

        	Error Trace:	utils.go:84
        	Error:      	Received unexpected error:
        	            	404 Not Found: no such index [metrics-kubernetes.event-k8s]
        	Test:       	TestKubernetesIntegrationRecipe/ES_data_should_pass_validations
{"log.level":"error","@timestamp":"2021-01-07T01:56:27.485Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:43\ngithub.com/elastic/cloud-on-k8s/test/e2e/test/helper.RunFile\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/helper/yaml.go:156\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.runBeatRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:96\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestKubernetesIntegrationRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:58\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1123"}
    --- FAIL: TestKubernetesIntegrationRecipe/ES_data_should_pass_validations (1800.00s)

or

  utils.go:84: 
        	Error Trace:	utils.go:84
        	Error:      	Received unexpected error:
        	            	404 Not Found: no such index [metrics-kubernetes.apiserver-k8s]
        	Test:       	TestKubernetesIntegrationRecipe/ES_data_should_pass_validations
{"log.level":"error","@timestamp":"2021-01-07T02:05:29.982Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:43\ngithub.com/elastic/cloud-on-k8s/test/e2e/test/helper.RunFile\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/helper/yaml.go:156\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.runBeatRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:96\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestKubernetesIntegrationRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:58\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1123"}
    --- FAIL: TestKubernetesIntegrationRecipe/ES_data_should_pass_validations (1800.00s)

pebrc avatar Jan 07 '21 09:01 pebrc

I believe the issue is that the data_streams are created lazily. So if no k8s event occurs after deploying the agent with the integration the data_stream won't be there either. I was able to reproduce this manually with the same recipe and as as soon as I generated an event the data_stream appeared.

I am not sure however why the same thing might happen for the kubernetes.apiserver dataset as I would expect enough activity if only by the operator itself to generate events.

pebrc avatar Jan 07 '21 13:01 pebrc

After disabling the kuberntes.events and the kubernetes.apiserver data sets we still see failures like the following:

 utils.go:84: 
        	Error Trace:	utils.go:84
        	Error:      	Received unexpected error:
        	            	hit count should be more than 0 for /metrics-kubernetes.container-k8s/_search?q=!error.message:*
        	Test:       	TestKubernetesIntegrationRecipe/ES_data_should_pass_validations
{"log.level":"error","@timestamp":"2021-01-12T02:11:39.915Z","message":"stopping early","service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0","error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:43\ngithub.com/elastic/cloud-on-k8s/test/e2e/test/helper.RunFile\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/helper/yaml.go:156\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.runBeatRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:99\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestKubernetesIntegrationRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:61\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1123"}
    --- FAIL: TestKubernetesIntegrationRecipe/ES_data_should_pass_validations (1800.00s)

pebrc avatar Jan 12 '21 09:01 pebrc

Looks like the failure is limited to the IPv6 Kind cluster. I'll take a look.

charith-elastic avatar Jan 13 '21 09:01 charith-elastic

Problem appears to be a DNS resolution failure:

error doing HTTP request to fetch 'container' Metricset data: error making http request: Get \"https://eck-e2e-worker3:10250/stats/summary\": lookup eck-e2e-worker3 on [fd00:10:96::a]:53: server misbehaving

Could be https://github.com/kubernetes/kubernetes/issues/39980 or some other configuration problem in Kind itself. It took me a few tries to reproduce the problem so it feels like a timing issue that only occurs under certain circumstances. I'll take a look with fresh eyes tomorrow to see if I can figure out a way to drill down to the root cause.

The test is looking for records in the data stream index that are not errors. Presumably, any record getting written to the index is proof that Agent is working in some capacity and that the deployment has worked. So, a short term solution to this test failure would be to remove that filter.

charith-elastic avatar Jan 13 '21 15:01 charith-elastic

Could be kubernetes/kubernetes#39980 or some other configuration problem in Kind itself.

I have raised https://github.com/elastic/cloud-on-k8s/issues/4117 I wonder if moving to a more recent version of kind will help here or whether that's grasping at straws.

pebrc avatar Jan 14 '21 08:01 pebrc

I tried with Kind 0.9.0 as well without any luck. After looking around Kind issues, https://github.com/kubernetes/kubernetes/issues/94794#issuecomment-696617958 makes a lot of sense to me. I think we need to run Kind on an IPv6 enabled host in order for the node name resolution inside the cluster to work.

charith-elastic avatar Jan 14 '21 13:01 charith-elastic

Closing this let's reopen if needed.

pebrc avatar May 23 '24 07:05 pebrc