[feat] Add topic stats and metrics for observing message replay behavior and Key_Shared filtering/blocking behavior
Search before asking
- [X] I searched in the issues and found nothing similar.
Motivation
Currently, it's very challenging to investigate issues related to message replay ("message redelivery controller"). Some examples of this include:
- The "repeated Read-and-discard when using Key_Shared mode" issue mitigated by:
- https://github.com/apache/pulsar/pull/22245
- https://github.com/apache/pulsar/pull/21739
- An older mitigation: #7105
Solution
Add topic stats and metrics for observing message replay and related Key_Shared filtering (hash blocking) behavior.
Specific Metrics to Consider
- Number of messages in redelivery (replay)
- For Key_Shared subscriptions: Ways to observe internal state related to blocked hashes
- Counter for delayed delivery messages being added to delivery (replay)
Implementation Requirements
- It should be possible to detect replays in topic stats (or internal stats) and also in aggregated metrics
- The aggregated metrics should be usable in monitoring tools (e.g., Grafana dashboards)
- The specific types of metrics (counters, gauges) to be used will be determined in the detailed design phase
Expected Benefits
- Improved observability for message replay and Key_Shared behavior
- Easier troubleshooting of related issues
- Enhanced monitoring capabilities for Pulsar clusters
Alternatives
No response
Anything else?
No response
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
It seems that PIP-282 added some subscription stats in https://github.com/apache/pulsar/pull/21953 that improve observability of Key_Shared.
There's already a counter for message redelivery: https://github.com/apache/pulsar/blob/77b6378ae8b9ac83962f71063ad44d6ac57f8e32/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Consumer.java#L959-L961 However, this isn't currently exposed in the subscription stats. This counter was added as part of Otel changes in https://github.com/apache/pulsar/pull/22693 . There's also an ack counter that was added: https://github.com/apache/pulsar/blob/77b6378ae8b9ac83962f71063ad44d6ac57f8e32/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Consumer.java#L955-L957
I think that it would be a non-breaking change to expose these in stats which wouldn't necessarily require a PIP.
PIP-379: Key_Shared Draining Hashes for Improved Message Ordering covers observability.
#23224 implemented msgInReplay / pulsar_subscription_in_replay.
#23429 adds observability for PIP-379 Key_Shared implementation.
drainingHashesCount, drainingHashesClearedTotal, drainingHashesUnackedMessages and drainingHashes
Closing this as resolved with #23224 and #23429 in PIP-379 implementation.