pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve][broker]Add the expired message count metric from this subscription within the last interval.

Open zjxxzjwang opened this issue 8 months ago • 16 comments

Motivation

The existing totalMsgExpired metric tracks cumulative message expiration counts since subscription creation, but lacks granularity for time-bound monitoring. This PR introduces a new metric to track message expiration counts per monitoring interval, enabling better observation of expiration patterns through time-series monitoring systems.

Modifications

  1. Added msgExpired counter in subscription statistics tracking message expirations per monitoring interval
  2. Modified PersistentMessageExpiryMonitor to expose new count metric
  3. Updated stats collection in PersistentSubscription to include new metric
  4. Enhanced subscription stats objects to carry the new metric

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

This improvement enables:

  • Monitoring sudden expiration spikes through time-series dashboards
  • Calculating expiration rates between sampling periods
  • Troubleshooting expiration patterns without needing long-term historical data
  • Alignment with other time-bound metrics in Pulsar's monitoring framework

zjxxzjwang avatar Mar 21 '25 11:03 zjxxzjwang

@zjxxzjwang Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

github-actions[bot] avatar Mar 21 '25 11:03 github-actions[bot]

@zjxxzjwang Please revisit the PR title and description. "within a cycle" and "over a long period of time" are vague and confusing and explaining it in other words would be useful. Once you have a draft version with other wording, you can also use the help of GenAI such as DeepSeek by following the instructions at https://gist.github.com/lhotari/81f533af4b9ad515e02d96e543c4408b for improving the title and description and then doing final edits.

lhotari avatar Mar 26 '25 09:03 lhotari

@zjxxzjwang请重新审视 PR 标题和描述。“在一个周期内”和“在很长一段时间内”含糊不清且令人困惑,用其他词语解释会很有用。一旦您有了其他措辞的草稿版本,您还可以按照https://gist.github.com/lhotari/81f533af4b9ad515e02d96e543c4408b上的说明,使用 GenAI(如 DeepSeek)的帮助来改进标题和描述,然后进行最终编辑。

Good suggestion, I have made the necessary modifications

zjxxzjwang avatar Mar 26 '25 11:03 zjxxzjwang

Good suggestion, I have made the necessary modifications

@zjxxzjwang "cycle" is still in the PR title and it's not an existing concept so it could be improved.

lhotari avatar Mar 28 '25 10:03 lhotari

很好的建议,我已做出必要的修改

@zjxxzjwang“循环”仍然在 PR 标题中,它不是一个现有概念,因此可以改进。

@lhotari I have made the modifications, do you think this is appropriate

zjxxzjwang avatar Mar 31 '25 03:03 zjxxzjwang

@lhotari Hello, please help me review it

zjxxzjwang avatar Apr 03 '25 06:04 zjxxzjwang

hello, Can you help me review it @lhotari

zjxxzjwang avatar Apr 07 '25 02:04 zjxxzjwang

hello, Can you help me review it @lhotari

@zjxxzjwang Did you already pass this through LLM like DeepSeek using https://gist.github.com/lhotari/81f533af4b9ad515e02d96e543c4408b ?

The concepts such as "indicator" and "during a sampling period", "timed task execution cycle", or "scheduled task execution period" aren't well known in Pulsar. It would be useful to improve the title and description of this PR further. I'm pretty sure that using an LLM and asking it to address review feedback about the PR title and description will be helpful. It will get to know about it by printing this PR page to PDF and providing it as context.

lhotari avatar Apr 08 '25 12:04 lhotari

你好,你能帮我审核一下吗@lhotari

@zjxxzjwang您是否已经像 DeepSeek 一样使用https://gist.github.com/lhotari/81f533af4b9ad515e02d96e543c4408b通过了 LLM ?

诸如“指标”、“采样周期”、“定时任务执行周期”或“计划任务执行周期”等概念在 Pulsar 中并不常见。进一步改进此 PR 的标题和描述会很有帮助。我确信,使用 LLM 并请求其处理关于 PR 标题和描述的审核反馈会很有帮助。它会通过将此 PR 页面打印为 PDF 并将其作为上下文提供来了解相关信息。

I used Deepseek for modification. Do you think this is feasible now. @lhotari

zjxxzjwang avatar Apr 09 '25 12:04 zjxxzjwang

I used Deepseek for modification. Do you think this is feasible now. @lhotari

@zjxxzjwang Yes, it's better now. I'm sorry for the slow feedback. I've been busy on releases and addressing CI flakiness.

[improve][broker]Add expired message count metric for monitoring periods in subscription stats Added msgExpired counter in subscription statistics tracking message expirations per monitoring interval

Also define what the "monitoring period" and "monitoring interval" is. Reference existing documentation or if there are gaps, explain what this is exactly. The existing documentation doesn't really seem to explain things properly: https://pulsar.apache.org/docs/4.0.x/administration-stats/ . However, there needs to be a way to explain what this is about.

The existing totalMsgExpired metric tracks cumulative message expiration counts since subscription creation, but lacks granularity for time-bound monitoring. This PR introduces a new metric to track message expiration counts per monitoring interval, enabling better observation of expiration patterns through time-series monitoring systems.

It would be useful to expand this description and explain why it "lacks granularity for time-bound monitoring". Referencing previous changes in Pulsar could be useful in this. You can use the git history to research why it's like this currently.

lhotari avatar Apr 11 '25 09:04 lhotari

I used Deepseek for modification. Do you think this is feasible now. @lhotari

@zjxxzjwang Yes, it's better now. I'm sorry for the slow feedback. I've been busy on releases and addressing CI flakiness.

[improve][broker]Add expired message count metric for monitoring periods in subscription stats Added msgExpired counter in subscription statistics tracking message expirations per monitoring interval

Also define what the "monitoring period" and "monitoring interval" is. Reference existing documentation or if there are gaps, explain what this is exactly. The existing documentation doesn't really seem to explain things properly: https://pulsar.apache.org/docs/4.0.x/administration-stats/ . However, there needs to be a way to explain what this is about.

The existing totalMsgExpired metric tracks cumulative message expiration counts since subscription creation, but lacks granularity for time-bound monitoring. This PR introduces a new metric to track message expiration counts per monitoring interval, enabling better observation of expiration patterns through time-series monitoring systems.

It would be useful to expand this description and explain why it "lacks granularity for time-bound monitoring". Referencing previous changes in Pulsar could be useful in this. You can use the git history to research why it's like this currently.

Hello, I have made modifications to the title content based on the description of the "averageMsSize" metric. Please help me review it. @lhotari Clipboard_Screenshot_1744968814

zjxxzjwang avatar Apr 18 '25 09:04 zjxxzjwang

@lhotari Can you help me with the review

zjxxzjwang avatar Apr 24 '25 03:04 zjxxzjwang

Can you help me with the review? @lhotari

zjxxzjwang avatar Apr 27 '25 08:04 zjxxzjwang

Can you help me with the review? @lhotari

zjxxzjwang avatar Apr 30 '25 08:04 zjxxzjwang

@nodece Hello, can you help me merge this PR? Thank you

zjxxzjwang avatar May 26 '25 10:05 zjxxzjwang

/pulsarbot rerun-failure-checks

HQebupt avatar May 27 '25 13:05 HQebupt

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 74.21%. Comparing base (bbc6224) to head (d0e26fb). :warning: Report is 1279 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24106      +/-   ##
============================================
+ Coverage     73.57%   74.21%   +0.64%     
- Complexity    32624    32675      +51     
============================================
  Files          1877     1868       -9     
  Lines        139502   145368    +5866     
  Branches      15299    16629    +1330     
============================================
+ Hits         102638   107885    +5247     
- Misses        28908    28932      +24     
- Partials       7956     8551     +595     
Flag Coverage Δ
inttests 26.77% <50.00%> (+2.18%) :arrow_up:
systests 23.33% <50.00%> (-0.99%) :arrow_down:
unittests 73.70% <100.00%> (+0.85%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ice/persistent/PersistentMessageExpiryMonitor.java 72.88% <100.00%> (+8.71%) :arrow_up:
...ker/service/persistent/PersistentSubscription.java 76.67% <100.00%> (-0.02%) :arrow_down:
...mon/policies/data/stats/SubscriptionStatsImpl.java 94.17% <100.00%> (+0.05%) :arrow_up:

... and 1082 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Jun 23 '25 11:06 codecov-commenter