beats icon indicating copy to clipboard operation
beats copied to clipboard

Add queue percentage to libbeat metrics

Open fearful-symmetry opened this issue 1 year ago • 7 comments

Proposed commit message

Part of https://github.com/elastic/beats/issues/38708

This adds a queue.full metric that reports the percentage of queue usage in libbeat.

I'm not sure if this is the correct way to measure queue usage in percentage, but this was such a small change I figured it would be faster to just put in a PR and ask, rather than ask and wait.

Testing

You can test this by starting metricbeat with --httpprof localhost:9898 and then checking the metrics with curl localhost:9898/debug/vars | grep libbeat

The metric will also appear in the last 30s metrics.

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

fearful-symmetry avatar Apr 24 '24 22:04 fearful-symmetry

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine avatar Apr 24 '24 22:04 elasticmachine

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

mergify[bot] avatar Apr 24 '24 22:04 mergify[bot]

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 113 min 21 sec

:grey_exclamation: Flaky test report

No test was executed to be analysed.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Apr 24 '24 22:04 elasticmachine

@cmacknz if I understand https://github.com/elastic/beats/issues/38708, all the rest of the the implementation of this is in integrations/dashboards. Is someone on the integrations side doing that, or are we doing the dashboards too?

fearful-symmetry avatar Apr 25 '24 19:04 fearful-symmetry

I would update the integration as part of this work. If you can't for some reason, create a separate issue to do it so it isn't lost.

cmacknz avatar Apr 26 '24 17:04 cmacknz

Now that we can know the queue size in bytes, I'm just wondering if we should make it more clear that the percentage is calculated from the number of events.

The queue size limit is still configured in units of events and the metric here should match the units of the maximum size limit.

cmacknz avatar Apr 26 '24 17:04 cmacknz

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Apr 28 '24 12:04 elasticmachine

Huh, it's possible for activeEvents > max_events:

{"clients":2,"events":{"active":3201,"published":99200,"total":99200},"queue":{"acked":99200,"filled":{"pct":{"events":1.0003125}},"max_events":3200}}

This happens occasionally, in the metrics, and it's always max_queue+1. I assume it's just a result of the queue itself not being completely synced up with the metrics reporter. That, or an off-by-one. @faec you seen this before?

fearful-symmetry avatar Apr 29 '24 17:04 fearful-symmetry

This happens occasionally, in the metrics, and it's always max_queue+1. I assume it's just a result of the queue itself not being completely synced up with the metrics reporter. That, or an off-by-one. @faec you seen this before?

Yeah, I've seen this, it's just an artifact of the metrics receiver getting notified of acks slightly after the queue unblocks so it sees the new event in flight before decrementing the old ones. I've never seen the pipeline report something that's actually wrong, just slightly behind on acks, though I did see recently that the output subtree sometimes has a number that's just flat-out wrong https://github.com/elastic/beats/issues/39146

faec avatar Apr 29 '24 17:04 faec

Alright, added rounding. I decided to keep the percent calculations "raw" (0.0-1.0, not 0.0-100.0), since that's what we do for the system metrics percentages, and I figure we might as well keep it consistent?

fearful-symmetry avatar Apr 29 '24 20:04 fearful-symmetry