Add queue percentage to libbeat metrics
Proposed commit message
Part of https://github.com/elastic/beats/issues/38708
This adds a queue.full metric that reports the percentage of queue usage in libbeat.
I'm not sure if this is the correct way to measure queue usage in percentage, but this was such a small change I figured it would be faster to just put in a PR and ask, rather than ask and wait.
Testing
You can test this by starting metricbeat with --httpprof localhost:9898 and then checking the metrics with curl localhost:9898/debug/vars | grep libbeat
The metric will also appear in the last 30s metrics.
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have added an entry in
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-v8./d.0is the label to automatically backport to the8./dbranch./dis the digit
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
- Duration: 113 min 21 sec
:grey_exclamation: Flaky test report
No test was executed to be analysed.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test: Re-trigger the build. -
/package: Generate the packages and run the E2E tests. -
/beats-tester: Run the installation tests with beats-tester. -
runelasticsearch-ci/docs: Re-trigger the docs validation. (use unformatted text in the comment!)
@cmacknz if I understand https://github.com/elastic/beats/issues/38708, all the rest of the the implementation of this is in integrations/dashboards. Is someone on the integrations side doing that, or are we doing the dashboards too?
I would update the integration as part of this work. If you can't for some reason, create a separate issue to do it so it isn't lost.
Now that we can know the queue size in bytes, I'm just wondering if we should make it more clear that the percentage is calculated from the number of events.
The queue size limit is still configured in units of events and the metric here should match the units of the maximum size limit.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Huh, it's possible for activeEvents > max_events:
{"clients":2,"events":{"active":3201,"published":99200,"total":99200},"queue":{"acked":99200,"filled":{"pct":{"events":1.0003125}},"max_events":3200}}
This happens occasionally, in the metrics, and it's always max_queue+1. I assume it's just a result of the queue itself not being completely synced up with the metrics reporter. That, or an off-by-one. @faec you seen this before?
This happens occasionally, in the metrics, and it's always max_queue+1. I assume it's just a result of the queue itself not being completely synced up with the metrics reporter. That, or an off-by-one. @faec you seen this before?
Yeah, I've seen this, it's just an artifact of the metrics receiver getting notified of acks slightly after the queue unblocks so it sees the new event in flight before decrementing the old ones. I've never seen the pipeline report something that's actually wrong, just slightly behind on acks, though I did see recently that the output subtree sometimes has a number that's just flat-out wrong https://github.com/elastic/beats/issues/39146
Alright, added rounding. I decided to keep the percent calculations "raw" (0.0-1.0, not 0.0-100.0), since that's what we do for the system metrics percentages, and I figure we might as well keep it consistent?