beats
beats copied to clipboard
[Metricbeat][Kubernetes] Share watchers between metricsets
Proposed commit message
- WHAT: metricsets share watchers between each other, instead of each having their own.
- WHY: Please check the issue https://github.com/elastic/beats/issues/37243.
Details
The needed watchers for each resource are defined in getExtraWatchers
function. You can check which are required in the table of section Expected watchers in this issue.
Note: only
state_resourcequota
does not have the expected watchers with this change. This is because we need to change the implementation of that metricset first.
We have a global map that saves all the watchers:
type watchers struct {
watchersMap map[string]*watcherData
lock sync.RWMutex
}
The key to this map is the resource name, and the values are defined as:
type watcherData struct {
metricsetsUsing []string // list of metricsets using this watcher
watcher kubernetes.Watcher
started bool // true if watcher has started, false otherwise
enrichers map[string]*enricher // map of enrichers using this watcher. The key is the metricset name
metadataObjects map[string]bool // map of ids of each object received by the handler functions
}
-
metricsetsUsing
contains the list of metricsets that are using this watcher. We need this because when the enricher callsStart()
orStop()
, the watchers start/stop. We cannot start a watcher more than once. We only stop a watcher if the list of metricsets using it is empty. We use metricset to avoid conflicts between metricsets that use the same resource, likestate_pod
andpod
. -
watcher
is the kubernetes watcher for the resource. -
started
just tells us if the watcher is started. This is mainly needed for theenricher.Start()
and for testing purposes. -
enrichers
is the list of enrichers for this watcher per each metricset -
metadataEvents
is the resulted metadata events from the resource event handler. Please see the next list, point 6.2, why this is necessary
The algorithm goes like this when NewResourceMetadataEnricher
is called:
- The configuration is validated. It will return a nil enricher if it fails.
- We create the configuration needed for the metadata generator. It will return a nil enricher if it fails.
- We create the K8s client. It will return a nil enricher if it fails.
- We start all the watchers:
- We first check if the resource exists. If it fails, we stop.
- We build the
kubernetes.WatchOptions{}
needed for the watcher. If it fails, we stop. - We start the watcher for this specific resource:
- We first check if the watcher is already created.
- If it is, then we don't do anything.
- Otherwise, we create a new watcher and put it in the map with key = resource name.
- We add this metricset to the list of metricsets we have that are using this watcher.
- We get all needed extra resources for this resource, and repeat step 3.
- We create the metadata generators.
- Lastly, create the enricher.
Considerations:
- Because each watcher only has one function for
UpdateFunc
/addFunc
andDeleteFunc
, we need to save which metricsets and respective enrichers need that handler function. For this, we keep track of the enrichers using a map, and iterate over that map when one of the functions is triggered. - It is possible that
AddFunc
is called for one metricset first, and when the other metricset starts, theAddFunc
is no longer triggered. To avoid the loss of metadata, we have the mapmetadataObjects
, that saves the id of the object that triggered the handler function. This way, for each enricher upon creation, we iterate over all this map and using theid
saved there, we get the object from the watcher store. Using this object, we call theupdate
function and ensure all enrichers have up to date metadata.
- Because each watcher only has one function for
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have made corresponding change to the default configuration files
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] I have added an entry in
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.
How to test this PR locally
- Clone this branch.
- Follow the steps of this README file to launch metricbeat with the changes.
- Check it is working as expected.
Related issues
- Relates to https://github.com/elastic/beats/issues/37243.
Results
Metricbeat
These results come from running metricbeat with this configuration for kubernetes module (all metricsets that launch watchers are enabled, the others are not).
metricbeat.autodiscover:
providers:
- type: kubernetes
scope: cluster
node: ${NODE_NAME}
unique: true
templates:
- config:
- module: kubernetes
hosts: ["kube-state-metrics:8080"]
period: 10s
# #add_metadata: true
metricsets:
- state_node
- state_deployment
- state_daemonset
- state_replicaset
- state_pod
- state_container
- state_cronjob
- state_job
#- state_resourcequota
- state_statefulset
- state_service
- state_persistentvolume
- state_persistentvolumeclaim
- state_storageclass
- state_namespace
- module: kubernetes
metricsets:
- node
- pod
- container
period: 10s
host: ${NODE_NAME}
hosts: ["https://${NODE_NAME}:10250"]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
ssl.verification_mode: "none"
The logs for the watchers initialization will look like this (only message
field displayed for simplicity):
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.094Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher node successfully, created by node.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher pod successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":368},"message":"Created watcher state_namespace successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"} <-------------------------------------
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.191Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher deployment successfully, created by state_deployment.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.206Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher daemonset successfully, created by state_daemonset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.516Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher replicaset successfully, created by state_replicaset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.118Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher cronjob successfully, created by state_cronjob.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.126Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher statefulset successfully, created by state_statefulset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.132Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher service successfully, created by state_service.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.139Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolume successfully, created by state_persistentvolume.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.146Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolumeclaim successfully, created by state_persistentvolumeclaim.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.152Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher storageclass successfully, created by state_storageclass.","service.name":"metricbeat","ecs.version":"1.6.0"}
Notice the line with <----------
: the pod was the resource that created the watcher for namespace, since it is one of the required resources, and it did not exist yet. This is also the reason why we don't see the line "message":"Created watcher state_namespace successfully, created by state_namespace."
, because by the time state_namespace
is iterating over the needed watchers, they are already created.
In Discover:
Elastic Agent
These results come from running EA with this standalone manifest, but with the custom image.
Logs:
These are the logs for starting the watchers (working as expected).
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher statefulset successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher state_namespace successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":321,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.257Z","message":"Started watcher node successfully, created by node.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.361Z","message":"Started watcher pod successfully, created by pod.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.470Z","message":"Started watcher deployment successfully, created by deployment.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.577Z","message":"Started watcher persistentvolume successfully, created by persistentvolume.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.679Z","message":"Started watcher persistentvolumeclaim successfully, created by persistentvolumeclaim.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.782Z","message":"Started watcher replicaset successfully, created by replicaset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.887Z","message":"Started watcher service successfully, created by service.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.017Z","message":"Started watcher storageclass successfully, created by storageclass.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.120Z","message":"Started watcher cronjob successfully, created by cronjob.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.225Z","message":"Started watcher daemonset successfully, created by daemonset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.329Z","message":"Started watcher job successfully, created by job.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
The results for the dashboards are (check if it still works):
- [x] [Metrics Kubernetes] Cronjobs
- [x] [Metrics Kubernetes] StatefulSets
- [x] [Metrics Kubernetes] Pods
- [x] [Metrics Kubernetes] Deployments
- [x] [Metrics Kubernetes] DaemonSets
- [x] [Metrics Kubernetes] Jobs
- [x] [Metrics Kubernetes] Nodes
- [x] [Metrics Kubernetes] PV/PVC
- [x] [Metrics Kubernetes] Cluster Overview
- [ ] [Metrics Kubernetes] Services - It is broken, but it is not related with the changes on this PR.
Note: only dashboards for resources that launch watchers are considered. There were no changes in the others.
Notes for testing
Important things to consider when testing this PR code changes:
- This PR changes only affect the kubernetes module metricsets that use metadata enrichment. These are
state_namespace state_node state_deployment state_daemonset state_replicaset state_pod state_container state_job state_cronjob state_statefulset state_service state_persistentvolume state_persistentvolumeclaim state_storageclass pod container node
- Everything that was working before this PR changes should still be working. The changes only reduce the number of watchers created from the different metricsets, thus reducing the k8s API calls.
- Thorough regression testing is needed. In more details: a. All events coming from the affected metricsets in Kibana should be enriched with own resource metadata (labels, annotations) and kubernetes node metadata and kubernetes namespace metadata when applicable. b. When a new metadata(like a new label) is added on a resource(i.e. pod) then the new events from the related metricset(pod, container, state_pod, state_container) should contain the new metadata c. When a new node or namespace label and annotation is added to a node/namespace, then the events from relevant metricsets(state_node, node or state_namespace) should include the new metadata. d. The events of the rest of the metricsets(i.e. state_pod or state_deployment) coming from resources in the updated node/namespace won't get the updated node or namespace metadata out of the box. e. In order for those events to be updated, there should be first an update in the metadata of these resources. For example if a node is labeled then the pods of that node won't get the new node label immediately. In order to get it, we should also add a label to these pod to trigger a watcher event. Then the new events will include the new pod label and node label f. Test with addition/removal of metadata on pods that run on the leader node and also on the non-leader nodes.
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @constanca-m? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
-
backport-v8./d.0
is the label to automatically backport to the8./d
branch./d
is the digit
:grey_exclamation: Build Aborted
There is a new build on-going so the previous on-going builds have been aborted.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2023-12-07T09:16:30.968+0000
-
Duration: 8 min 49 sec
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:grey_exclamation: Build Aborted
Either there was a build timeout or someone aborted the build.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 22 min 52 sec
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:grey_exclamation: Build Aborted
Either there was a build timeout or someone aborted the build.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 18 min 18 sec
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 50 min 17 sec
:grey_exclamation: Flaky test report
No test was executed to be analysed.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
-
This PR will also need testing with Agent for sure. We need to build the agent and repeat the same tests and see if we dont break anything.
-
Also I would need to run some E2E tests to see that metadata enrichemnt are ok.
-
We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case
We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case
I think maybe we should move those new decisions to a new PR @gizas
Without having checked the code line by line, I believe that this approach is not aligned with kubeStateMetricsCache and kubeletStatsCache approach where we try to solve a similar issue.
I think this approach still works this way and does basically the same thing.
It is harder to use util/kubernetes
with the kubernetes
in the parent folder, because we would have the import cycle error in go at all times. The only workaround i could find for that, would be to pass the funcitons as parameters, but it is very hard to read the code that way.
I added unit tests for every function, and they work just fine. @MichaelKatsoulis
Edit: It is the same approach we are already using for state_metricset
shared map.
It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times
You need to define a watchersCache in utils like we do with MetricsRepo https://github.com/elastic/beats/blob/a8d1567d928680947f5868a1fe94851698f80b11/metricbeat/module/kubernetes/kubernetes.go#L88 and https://github.com/elastic/beats/blob/a8d1567d928680947f5868a1fe94851698f80b11/metricbeat/module/kubernetes/util/metrics_repo.go#L71
Then its pointer can be passed to NewResourceMetadataEnricher like metricsRepo.
Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?
Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?
Yes, the number of watchers are correct. I posted the results in Results in the description from running metricbeat. I also added the unit tests. EA was also tested, results are now in the description. @MichaelKatsoulis
Any test in specific I should do? Any specific situation?
You need to define a watchersCache in utils like we do with MetricsRepo
I updated the code so now it works like this @MichaelKatsoulis
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2024-01-08T08:11:22.455+0000
-
Duration: 56 min 0 sec
Test stats :test_tube:
Test | Results |
---|---|
Failed | 0 |
Passed | 4573 |
Skipped | 902 |
Total | 5475 |
:green_heart: Flaky test report
Tests succeeded.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2024-01-26T13:11:56.339+0000
-
Duration: 51 min 13 sec
Test stats :test_tube:
Test | Results |
---|---|
Failed | 0 |
Passed | 4581 |
Skipped | 902 |
Total | 5483 |
:green_heart: Flaky test report
Tests succeeded.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-multiple-watchers upstream/fix-multiple-watchers
git merge upstream/main
git push upstream fix-multiple-watchers
:grey_exclamation: Build Aborted
Either there was a build timeout or someone aborted the build.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 8 min 54 sec
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:grey_exclamation: Build Aborted
Either there was a build timeout or someone aborted the build.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 19 min 43 sec
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 51 min 9 sec
:grey_exclamation: Flaky test report
No test was executed to be analysed.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 50 min 56 sec
:grey_exclamation: Flaky test report
No test was executed to be analysed.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2024-01-29T12:13:25.100+0000
-
Duration: 49 min 42 sec
Test stats :test_tube:
Test | Results |
---|---|
Failed | 0 |
Passed | 4581 |
Skipped | 902 |
Total | 5483 |
:green_heart: Flaky test report
Tests succeeded.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:grey_exclamation: Build Aborted
Either there was a build timeout or someone aborted the build.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 7 min 52 sec
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2024-01-29T14:44:19.190+0000
-
Duration: 51 min 41 sec
Test stats :test_tube:
Test | Results |
---|---|
Failed | 0 |
Passed | 4581 |
Skipped | 902 |
Total | 5483 |
:green_heart: Flaky test report
Tests succeeded.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 50 min 54 sec
:grey_exclamation: Flaky test report
No test was executed to be analysed.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:grey_exclamation: Build Aborted
There is a new build on-going so the previous on-going builds have been aborted.
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2024-02-02T11:19:36.796+0000
-
Duration: 9 min 8 sec
Steps errors 
Expand to view the steps failures
Error signal
- Took 0 min 0 sec . View more details here
- Description:
Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 181 min 10 sec
:grey_exclamation: Flaky test report
No test was executed to be analysed.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @constanca-m? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
-
backport-v8./d.0
is the label to automatically backport to the8./d
branch./d
is the digit
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-multiple-watchers upstream/fix-multiple-watchers
git merge upstream/main
git push upstream fix-multiple-watchers
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
CPU and memory usage
To test this, we deploy 50 pods in addition to the default pods in the cluster.
We have 1 node cluster.
To run the 50 pods, use ./stress_test_k8s --kubeconfig=/home/c/.kube/config --deployments=5 --namespaces=10
from this directory.
Metricsets enabled are the ones affected by this change. That is, metricsets: state_node
, state_deployment
, state_daemonset
, state_replicaset
, state_pod
, state_container
, state_cronjob
, state_statefulset
, state_service
, state_persistentvolume
, state_persistentvolumeclaim
, state_storageclass
, state_namespace
, node
, pod
, container
.
Additionally, apiserver
is also used if needed to check the API calls (it is not possible to filter by pod name here, so it might complicate the test).
Results:
The first part of the graph is for metricbeat running 8.12.2 image, while the second part is for metricbeat running an image generated from this branch.
CPU usage is not that different between the two, but the metricbeat created from this PR takes clear less memory.
Doing the same study, but now for a 5 node cluster with 74 pods.
Results:
The left lines of each graph is the metricbeat 8.12.2 and the right part is the metricbeat custom.
There is not much difference in CPU, but strangely, the custom metricbeat takes more memory than metricbeat 8.12.2
The results on past comment were not expected. The image created from this PR should lead to a lower memory usage, like we observed in 1 node cluster.
I decided to run it again, and these the results were quite different and looked as the ones expected:
On the right, instances from metricbeat 8.12.2, and on the left instances from custom image.
I don't know why we observed a higher memory usage before:
- There was no increase in the number of pods in the two tests.
- The image used for the deployments was correct as well, as I am using different manifest for both. One named
metricbeat-8-12
and anothermetricbeat-custom
, like we can see in the visualizations.
I will run a 3rd test to confirm the results.