cloud-on-k8s agent.TestSystemIntegrationRecipe is flaky

TestSystemIntegrationRecipe failed this night. This may be related to the upgrade of the stack version to 7.17.0.

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-aks/930/testReport/
https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-kind-k8s-versions/666/testReport/

=== RUN   TestSystemIntegrationRecipe/ES_data_should_pass_validations
Retries (30m0s timeout): ....................................................
    step.go:43: 
        	Error Trace:	utils.go:87
        	Error:      	Received unexpected error:
        	            	hit count should be more than 0 for /metrics-system.fsstat-default/_search?q=!error.message:*
        	Test:       	TestSystemIntegrationRecipe/ES_data_should_pass_validations
{
  "log.level":"error", "@timestamp":"2022-02-08T02:19:41.483Z", "message":"stopping early",
  "service.version":"0.0.0-SNAPSHOT+00000000","service.type":"eck","ecs.version":"1.4.0",
  "error":"test failure","error.stack_trace":"github.com/elastic/cloud-on-k8s/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:44\ngithub.com/elastic/cloud-on-k8s/test/e2e/test/helper.RunFile\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/helper/yaml.go:162\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.runAgentRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:226\ngithub.com/elastic/cloud-on-k8s/test/e2e/agent.TestSystemIntegrationRecipe\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/recipes_test.go:43\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1259"}
    --- FAIL: TestSystemIntegrationRecipe/ES_data_should_pass_validations (1800.00s)

According to the diagnostic, the data seems to be there:

# e2e-k0rz2-mercury/elasticsearch/elasticsearch-wz4z/cat/cat_indices.txt
health status index                                                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size sth
green  open   .ds-metrics-system.fsstat-default-2022.02.08-000001               36fLscccTXqKppgkXX0evQ   1   1         93            0    605.8kb          282kb false

Feb 08 '22 11:02 thbkrkr

I looked a bit more into the AKS failure.

The Elastic Agent log for metricbeat has the following error:

{"log.level":"error","@timestamp":"2022-02-23T15:34:24.797Z","log.origin":{"file.name":"module/wrapper.go","file.line":254},"message":"Error fetching data for metricset system.filesystem: error getting filesystem list: open /etc/mtab: no such file or directory","service.name":"metricbeat","ecs.version":"1.6.0"}

As far as I understand AKS uses containerd while GKE where the test succeeds still uses Docker as the container runtime. Going by this issue it appears to be the case that containerd does not create the symlink from /proc/mounts to /etc/mtab which Docker does create. Beats uses /etc/mtab to figure out which filesystems are mounted using a fork of Cloudfoundry's gosigar library https://github.com/elastic/gosigar

I assume that OCP with cri-o might also be affected

Feb 23 '22 17:02 pebrc

Still trying to understand why this only affects certain tests. For example the standalone Metricbeat version of these tests pass. The fsstat metricset still does not work but the error itself is ingested into the Elasticsearch and produces an event in the relevant datastream:

k8surl GET _  "*beat*/_search?q=event.dataset:system.fsstat" 
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 5.5984216,
    "hits": [
      {
        "_index": ".ds-metricbeat-8.0.0-2022.02.28-000001",
        "_id": "lsZqQH8BejqcX6edR4ae",
        "_score": 5.5984216,
        "_source": {
          "@timestamp": "2022-02-28T12:59:32.097Z",
          "metricset": {
            "name": "fsstat",
            "period": 60000
          },
          "event": {
            "module": "system",
            "duration": 57001,
            "dataset": "system.fsstat"
          },
          "service": {
            "type": "system"
          },
          "error": {
            "message": "filesystem list: open /etc/mtab: no such file or directory"
          },
          "ecs": {
            "version": "8.0.0"
          },
          "host": {
            "name": "...",
            "architecture": "x86_64",
            "os": {
              "platform": "ubuntu",
              "version": "20.04.3 LTS (Focal Fossa)",
              "family": "debian",
              "name": "Ubuntu",
              "kernel": "5.4.0-1068-azure",
              "codename": "focal",
              "type": "linux"
            },
            "containerized": true,
            "ip": [
            ...
            ],
            "mac": [
             ...
            ],
            "hostname": "..."
          },
          "agent": {
            "name": "...",
            "type": "metricbeat",
            "version": "8.0.0",
            "ephemeral_id": "51a0f243-8e6b-4f09-88b1-e47284e2c43a",
            "id": "83c88b0b-da17-4094-ae77-6ce711d3d0a5"
          },
          "cloud": {
            "machine": {
              "type": "Standard_D8s_v3"
            },
            "service": {
              "name": "Virtual Machines"
            },
            "region": "...",
            "provider": "azure",
            "account": {},
            "instance": {
              "id": "...",
              "name": "..."
            }
          }
        }
      }
    ]
  }
}

Feb 28 '22 13:02 pebrc

Closing because it seems to be a temporary problem, which has not happened again for over a year.

Jun 23 '23 21:06 thbkrkr