agent.TestSystemIntegrationRecipe is flaky
TestSystemIntegrationRecipe failed this night. This may be related to the upgrade of the stack version to 7.17.0.
- https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-aks/930/testReport/
- https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-kind-k8s-versions/666/testReport/
=== RUN TestSystemIntegrationRecipe/ES_data_should_pass_validations
Retries (30m0s timeout): ....................................................
step.go:43:
Error Trace: utils.go:87
Error: Received unexpected error:
hit count should be more than 0 for /metrics-system.fsstat-default/_search?q=!error.message:*
Test: TestSystemIntegrationRecipe/ES_data_should_pass_validations
{
"log.level":"error", "@timestamp":"2022-02-08T02:19:41.483Z", "message":"stopping early",
"service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0",
"error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:44\ngithub.com/elastic/cloud-on-k8s/test/e2e/test/helper.RunFile\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/helper/yaml.go:162\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.runAgentRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:226\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestSystemIntegrationRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:43\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1259"}
--- FAIL: TestSystemIntegrationRecipe/ES_data_should_pass_validations (1800.00s)
According to the diagnostic, the data seems to be there:
# e2e-k0rz2-mercury/elasticsearch/elasticsearch-wz4z/cat/cat_indices.txt
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size sth
green open .ds-metrics-system.fsstat-default-2022.02.08-000001 36fLscccTXqKppgkXX0evQ 1 1 93 0 605.8kb 282kb false
I looked a bit more into the AKS failure.
The Elastic Agent log for metricbeat has the following error:
{"log.level":"error","@timestamp":"2022-02-23T15:34:24.797Z","log.origin":{"file.name":"module/wrapper.go","file.line":254},"message":"Error fetching data for metricset system.filesystem: error getting filesystem list: open /etc/mtab: no such file or directory","service.name":"metricbeat","ecs.version":"1.6.0"}
As far as I understand AKS uses containerd while GKE where the test succeeds still uses Docker as the container runtime. Going by this issue it appears to be the case that containerd does not create the symlink from /proc/mounts to /etc/mtab which Docker does create. Beats uses /etc/mtab to figure out which filesystems are mounted using a fork of Cloudfoundry's gosigar library https://github.com/elastic/gosigar
I assume that OCP with cri-o might also be affected
Still trying to understand why this only affects certain tests. For example the standalone Metricbeat version of these tests pass. The fsstat metricset still does not work but the error itself is ingested into the Elasticsearch and produces an event in the relevant datastream:
k8surl GET _ "*beat*/_search?q=event.dataset:system.fsstat"
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 5.5984216,
"hits": [
{
"_index": ".ds-metricbeat-8.0.0-2022.02.28-000001",
"_id": "lsZqQH8BejqcX6edR4ae",
"_score": 5.5984216,
"_source": {
"@timestamp": "2022-02-28T12:59:32.097Z",
"metricset": {
"name": "fsstat",
"period": 60000
},
"event": {
"module": "system",
"duration": 57001,
"dataset": "system.fsstat"
},
"service": {
"type": "system"
},
"error": {
"message": "filesystem list: open /etc/mtab: no such file or directory"
},
"ecs": {
"version": "8.0.0"
},
"host": {
"name": "...",
"architecture": "x86_64",
"os": {
"platform": "ubuntu",
"version": "20.04.3 LTS (Focal Fossa)",
"family": "debian",
"name": "Ubuntu",
"kernel": "5.4.0-1068-azure",
"codename": "focal",
"type": "linux"
},
"containerized": true,
"ip": [
...
],
"mac": [
...
],
"hostname": "..."
},
"agent": {
"name": "...",
"type": "metricbeat",
"version": "8.0.0",
"ephemeral_id": "51a0f243-8e6b-4f09-88b1-e47284e2c43a",
"id": "83c88b0b-da17-4094-ae77-6ce711d3d0a5"
},
"cloud": {
"machine": {
"type": "Standard_D8s_v3"
},
"service": {
"name": "Virtual Machines"
},
"region": "...",
"provider": "azure",
"account": {},
"instance": {
"id": "...",
"name": "..."
}
}
}
}
]
}
}
Closing because it seems to be a temporary problem, which has not happened again for over a year.