beats icon indicating copy to clipboard operation
beats copied to clipboard

[Metricbeat][Kubernetes] Share watchers between metricsets

Open constanca-m opened this issue 1 year ago • 33 comments

Proposed commit message

  • WHAT: metricsets share watchers between each other, instead of each having their own.
  • WHY: Please check the issue https://github.com/elastic/beats/issues/37243.

Details

The needed watchers for each resource are defined in getExtraWatchers function. You can check which are required in the table of section Expected watchers in this issue.

Note: only state_resourcequota does not have the expected watchers with this change. This is because we need to change the implementation of that metricset first.

We have a global map that saves all the watchers:

type watchers struct {
	watchersMap map[string]*watcherData
	lock        sync.RWMutex
}

The key to this map is the resource name, and the values are defined as:

type watcherData struct {
	metricsetsUsing []string // list of metricsets using this watcher

	watcher kubernetes.Watcher
	started bool // true if watcher has started, false otherwise

	enrichers map[string]*enricher // map of enrichers using this watcher. The key is the metricset name

	metadataObjects map[string]bool // map of ids of each object received by the handler functions
}
  • metricsetsUsing contains the list of metricsets that are using this watcher. We need this because when the enricher calls Start() or Stop(), the watchers start/stop. We cannot start a watcher more than once. We only stop a watcher if the list of metricsets using it is empty. We use metricset to avoid conflicts between metricsets that use the same resource, like state_pod and pod.
  • watcher is the kubernetes watcher for the resource.
  • started just tells us if the watcher is started. This is mainly needed for the enricher.Start() and for testing purposes.
  • enrichers is the list of enrichers for this watcher per each metricset
  • metadataEvents is the resulted metadata events from the resource event handler. Please see the next list, point 6.2, why this is necessary

The algorithm goes like this when NewResourceMetadataEnricher is called:

  1. The configuration is validated. It will return a nil enricher if it fails.
  2. We create the configuration needed for the metadata generator. It will return a nil enricher if it fails.
  3. We create the K8s client. It will return a nil enricher if it fails.
  4. We start all the watchers:
    1. We first check if the resource exists. If it fails, we stop.
    2. We build the kubernetes.WatchOptions{} needed for the watcher. If it fails, we stop.
    3. We start the watcher for this specific resource:
      1. We first check if the watcher is already created.
      2. If it is, then we don't do anything.
      3. Otherwise, we create a new watcher and put it in the map with key = resource name.
      4. We add this metricset to the list of metricsets we have that are using this watcher.
    4. We get all needed extra resources for this resource, and repeat step 3.
  5. We create the metadata generators.
  6. Lastly, create the enricher. Considerations:
    1. Because each watcher only has one function for UpdateFunc / addFunc and DeleteFunc, we need to save which metricsets and respective enrichers need that handler function. For this, we keep track of the enrichers using a map, and iterate over that map when one of the functions is triggered.
    2. It is possible that AddFunc is called for one metricset first, and when the other metricset starts, the AddFunc is no longer triggered. To avoid the loss of metadata, we have the map metadataObjects, that saves the id of the object that triggered the handler function. This way, for each enricher upon creation, we iterate over all this map and using the id saved there, we get the object from the watcher store. Using this object, we call the update function and ensure all enrichers have up to date metadata.

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [x] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Clone this branch.
  2. Follow the steps of this README file to launch metricbeat with the changes.
  3. Check it is working as expected.

Related issues

  • Relates to https://github.com/elastic/beats/issues/37243.

Results

Metricbeat

These results come from running metricbeat with this configuration for kubernetes module (all metricsets that launch watchers are enabled, the others are not).
metricbeat.autodiscover:
  providers:
    - type: kubernetes
      scope: cluster
      node: ${NODE_NAME}
      unique: true
      templates:
        - config:
            - module: kubernetes
              hosts: ["kube-state-metrics:8080"]
              period: 10s
            #  #add_metadata: true
              metricsets:
                - state_node
                - state_deployment
                - state_daemonset
                - state_replicaset
                - state_pod
                - state_container
                - state_cronjob
                - state_job
                #- state_resourcequota
                - state_statefulset
                - state_service
                - state_persistentvolume
                - state_persistentvolumeclaim
                - state_storageclass
                - state_namespace
            - module: kubernetes
              metricsets:
                - node
                - pod
                - container
              period: 10s
              host: ${NODE_NAME}
              hosts: ["https://${NODE_NAME}:10250"]
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              ssl.verification_mode: "none"

The logs for the watchers initialization will look like this (only message field displayed for simplicity):

{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.094Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher node successfully, created by node.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher pod successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":368},"message":"Created watcher state_namespace successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}                <-------------------------------------
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.191Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher deployment successfully, created by state_deployment.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.206Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher daemonset successfully, created by state_daemonset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.516Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher replicaset successfully, created by state_replicaset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.118Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher cronjob successfully, created by state_cronjob.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.126Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher statefulset successfully, created by state_statefulset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.132Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher service successfully, created by state_service.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.139Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolume successfully, created by state_persistentvolume.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.146Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolumeclaim successfully, created by state_persistentvolumeclaim.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.152Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher storageclass successfully, created by state_storageclass.","service.name":"metricbeat","ecs.version":"1.6.0"}

Notice the line with <----------: the pod was the resource that created the watcher for namespace, since it is one of the required resources, and it did not exist yet. This is also the reason why we don't see the line "message":"Created watcher state_namespace successfully, created by state_namespace.", because by the time state_namespace is iterating over the needed watchers, they are already created.

In Discover: image

Elastic Agent

These results come from running EA with this standalone manifest, but with the custom image.

Logs:

These are the logs for starting the watchers (working as expected).
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher statefulset successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher state_namespace successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":321,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.257Z","message":"Started watcher node successfully, created by node.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.361Z","message":"Started watcher pod successfully, created by pod.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.470Z","message":"Started watcher deployment successfully, created by deployment.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.577Z","message":"Started watcher persistentvolume successfully, created by persistentvolume.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.679Z","message":"Started watcher persistentvolumeclaim successfully, created by persistentvolumeclaim.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.782Z","message":"Started watcher replicaset successfully, created by replicaset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.887Z","message":"Started watcher service successfully, created by service.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.017Z","message":"Started watcher storageclass successfully, created by storageclass.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.120Z","message":"Started watcher cronjob successfully, created by cronjob.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.225Z","message":"Started watcher daemonset successfully, created by daemonset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.329Z","message":"Started watcher job successfully, created by job.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}

The results for the dashboards are (check if it still works):

  • [x] [Metrics Kubernetes] Cronjobs
  • [x] [Metrics Kubernetes] StatefulSets
  • [x] [Metrics Kubernetes] Pods
  • [x] [Metrics Kubernetes] Deployments
  • [x] [Metrics Kubernetes] DaemonSets
  • [x] [Metrics Kubernetes] Jobs
  • [x] [Metrics Kubernetes] Nodes
  • [x] [Metrics Kubernetes] PV/PVC
  • [x] [Metrics Kubernetes] Cluster Overview
  • [ ] [Metrics Kubernetes] Services - It is broken, but it is not related with the changes on this PR.

Note: only dashboards for resources that launch watchers are considered. There were no changes in the others.

Notes for testing

Important things to consider when testing this PR code changes:

  • This PR changes only affect the kubernetes module metricsets that use metadata enrichment. These are state_namespace state_node state_deployment state_daemonset state_replicaset state_pod state_container state_job state_cronjob state_statefulset state_service state_persistentvolume state_persistentvolumeclaim state_storageclass pod container node
  • Everything that was working before this PR changes should still be working. The changes only reduce the number of watchers created from the different metricsets, thus reducing the k8s API calls.
  • Thorough regression testing is needed. In more details: a. All events coming from the affected metricsets in Kibana should be enriched with own resource metadata (labels, annotations) and kubernetes node metadata and kubernetes namespace metadata when applicable. b. When a new metadata(like a new label) is added on a resource(i.e. pod) then the new events from the related metricset(pod, container, state_pod, state_container) should contain the new metadata c. When a new node or namespace label and annotation is added to a node/namespace, then the events from relevant metricsets(state_node, node or state_namespace) should include the new metadata. d. The events of the rest of the metricsets(i.e. state_pod or state_deployment) coming from resources in the updated node/namespace won't get the updated node or namespace metadata out of the box. e. In order for those events to be updated, there should be first an update in the metadata of these resources. For example if a node is labeled then the pods of that node won't get the new node label immediately. In order to get it, we should also add a label to these pod to trigger a watcher event. Then the new events will include the new pod label and node label f. Test with addition/removal of metadata on pods that run on the leader node and also on the non-leader nodes.

constanca-m avatar Dec 07 '23 09:12 constanca-m

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @constanca-m? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

mergify[bot] avatar Dec 07 '23 09:12 mergify[bot]

:grey_exclamation: Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-07T09:16:30.968+0000

  • Duration: 8 min 49 sec

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Dec 07 '23 09:12 elasticmachine

:grey_exclamation: Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 22 min 52 sec

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Dec 07 '23 09:12 elasticmachine

:grey_exclamation: Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 18 min 18 sec

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Dec 07 '23 09:12 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 50 min 17 sec

:grey_exclamation: Flaky test report

No test was executed to be analysed.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Dec 07 '23 10:12 elasticmachine

  • This PR will also need testing with Agent for sure. We need to build the agent and repeat the same tests and see if we dont break anything.

  • Also I would need to run some E2E tests to see that metadata enrichemnt are ok.

  • We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case

gizas avatar Dec 07 '23 10:12 gizas

We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case

I think maybe we should move those new decisions to a new PR @gizas

constanca-m avatar Dec 07 '23 11:12 constanca-m

Without having checked the code line by line, I believe that this approach is not aligned with kubeStateMetricsCache and kubeletStatsCache approach where we try to solve a similar issue.

I think this approach still works this way and does basically the same thing.

It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times. The only workaround i could find for that, would be to pass the funcitons as parameters, but it is very hard to read the code that way.

I added unit tests for every function, and they work just fine. @MichaelKatsoulis

Edit: It is the same approach we are already using for state_metricset shared map.

constanca-m avatar Dec 07 '23 11:12 constanca-m

It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times

You need to define a watchersCache in utils like we do with MetricsRepo https://github.com/elastic/beats/blob/a8d1567d928680947f5868a1fe94851698f80b11/metricbeat/module/kubernetes/kubernetes.go#L88 and https://github.com/elastic/beats/blob/a8d1567d928680947f5868a1fe94851698f80b11/metricbeat/module/kubernetes/util/metrics_repo.go#L71

Then its pointer can be passed to NewResourceMetadataEnricher like metricsRepo.

Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?

MichaelKatsoulis avatar Dec 07 '23 14:12 MichaelKatsoulis

Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?

Yes, the number of watchers are correct. I posted the results in Results in the description from running metricbeat. I also added the unit tests. EA was also tested, results are now in the description. @MichaelKatsoulis

Any test in specific I should do? Any specific situation?

constanca-m avatar Dec 07 '23 15:12 constanca-m

You need to define a watchersCache in utils like we do with MetricsRepo

I updated the code so now it works like this @MichaelKatsoulis

constanca-m avatar Dec 27 '23 09:12 constanca-m

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-08T08:11:22.455+0000

  • Duration: 56 min 0 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 4573
Skipped 902
Total 5475

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 08 '24 09:01 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-26T13:11:56.339+0000

  • Duration: 51 min 13 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 4581
Skipped 902
Total 5483

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 26 '24 14:01 elasticmachine

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-multiple-watchers upstream/fix-multiple-watchers
git merge upstream/main
git push upstream fix-multiple-watchers

mergify[bot] avatar Jan 26 '24 14:01 mergify[bot]

:grey_exclamation: Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 8 min 54 sec

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 26 '24 14:01 elasticmachine

:grey_exclamation: Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 19 min 43 sec

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 26 '24 14:01 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 51 min 9 sec

:grey_exclamation: Flaky test report

No test was executed to be analysed.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 26 '24 15:01 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 50 min 56 sec

:grey_exclamation: Flaky test report

No test was executed to be analysed.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 29 '24 10:01 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-29T12:13:25.100+0000

  • Duration: 49 min 42 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 4581
Skipped 902
Total 5483

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 29 '24 13:01 elasticmachine

:grey_exclamation: Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 7 min 52 sec

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 29 '24 14:01 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-01-29T14:44:19.190+0000

  • Duration: 51 min 41 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 4581
Skipped 902
Total 5483

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 29 '24 15:01 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 50 min 54 sec

:grey_exclamation: Flaky test report

No test was executed to be analysed.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Jan 30 '24 17:01 elasticmachine

:grey_exclamation: Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-02-02T11:19:36.796+0000

  • Duration: 9 min 8 sec

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Feb 02 '24 11:02 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 181 min 10 sec

:grey_exclamation: Flaky test report

No test was executed to be analysed.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Feb 02 '24 12:02 elasticmachine

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @constanca-m? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

mergify[bot] avatar Feb 05 '24 14:02 mergify[bot]

This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-multiple-watchers upstream/fix-multiple-watchers
git merge upstream/main
git push upstream fix-multiple-watchers

mergify[bot] avatar Feb 19 '24 09:02 mergify[bot]

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine avatar Feb 19 '24 14:02 elasticmachine

CPU and memory usage

To test this, we deploy 50 pods in addition to the default pods in the cluster.

We have 1 node cluster.

To run the 50 pods, use ./stress_test_k8s --kubeconfig=/home/c/.kube/config --deployments=5 --namespaces=10 from this directory.

Metricsets enabled are the ones affected by this change. That is, metricsets: state_node, state_deployment, state_daemonset, state_replicaset, state_pod, state_container, state_cronjob, state_statefulset, state_service, state_persistentvolume, state_persistentvolumeclaim, state_storageclass, state_namespace, node, pod, container. Additionally, apiserver is also used if needed to check the API calls (it is not possible to filter by pod name here, so it might complicate the test).

Results: image

The first part of the graph is for metricbeat running 8.12.2 image, while the second part is for metricbeat running an image generated from this branch.

CPU usage is not that different between the two, but the metricbeat created from this PR takes clear less memory.

constanca-m avatar Mar 11 '24 16:03 constanca-m

Doing the same study, but now for a 5 node cluster with 74 pods.

Results:

image

The left lines of each graph is the metricbeat 8.12.2 and the right part is the metricbeat custom.

There is not much difference in CPU, but strangely, the custom metricbeat takes more memory than metricbeat 8.12.2

constanca-m avatar Mar 12 '24 09:03 constanca-m

The results on past comment were not expected. The image created from this PR should lead to a lower memory usage, like we observed in 1 node cluster.

I decided to run it again, and these the results were quite different and looked as the ones expected:

image

On the right, instances from metricbeat 8.12.2, and on the left instances from custom image.

I don't know why we observed a higher memory usage before:

  • There was no increase in the number of pods in the two tests.
  • The image used for the deployments was correct as well, as I am using different manifest for both. One named metricbeat-8-12 and another metricbeat-custom, like we can see in the visualizations.

I will run a 3rd test to confirm the results.

constanca-m avatar Mar 12 '24 13:03 constanca-m