pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[feat] Add topic stats and metrics for observing message replay behavior and Key_Shared filtering/blocking behavior

Open lhotari opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

Currently, it's very challenging to investigate issues related to message replay ("message redelivery controller"). Some examples of this include:

  • The "repeated Read-and-discard when using Key_Shared mode" issue mitigated by:
    • https://github.com/apache/pulsar/pull/22245
    • https://github.com/apache/pulsar/pull/21739
  • An older mitigation: #7105

Solution

Add topic stats and metrics for observing message replay and related Key_Shared filtering (hash blocking) behavior.

Specific Metrics to Consider

  1. Number of messages in redelivery (replay)
  2. For Key_Shared subscriptions: Ways to observe internal state related to blocked hashes
  3. Counter for delayed delivery messages being added to delivery (replay)

Implementation Requirements

  • It should be possible to detect replays in topic stats (or internal stats) and also in aggregated metrics
  • The aggregated metrics should be usable in monitoring tools (e.g., Grafana dashboards)
  • The specific types of metrics (counters, gauges) to be used will be determined in the detailed design phase

Expected Benefits

  • Improved observability for message replay and Key_Shared behavior
  • Easier troubleshooting of related issues
  • Enhanced monitoring capabilities for Pulsar clusters

Alternatives

No response

Anything else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

lhotari avatar Aug 20 '24 12:08 lhotari

It seems that PIP-282 added some subscription stats in https://github.com/apache/pulsar/pull/21953 that improve observability of Key_Shared.

lhotari avatar Aug 22 '24 06:08 lhotari

There's already a counter for message redelivery: https://github.com/apache/pulsar/blob/77b6378ae8b9ac83962f71063ad44d6ac57f8e32/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Consumer.java#L959-L961 However, this isn't currently exposed in the subscription stats. This counter was added as part of Otel changes in https://github.com/apache/pulsar/pull/22693 . There's also an ack counter that was added: https://github.com/apache/pulsar/blob/77b6378ae8b9ac83962f71063ad44d6ac57f8e32/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/Consumer.java#L955-L957

I think that it would be a non-breaking change to expose these in stats which wouldn't necessarily require a PIP.

lhotari avatar Aug 22 '24 12:08 lhotari

#23224 implemented msgInReplay / pulsar_subscription_in_replay.

lhotari avatar Oct 14 '24 11:10 lhotari

#23429 adds observability for PIP-379 Key_Shared implementation. drainingHashesCount, drainingHashesClearedTotal, drainingHashesUnackedMessages and drainingHashes

lhotari avatar Oct 14 '24 11:10 lhotari

Closing this as resolved with #23224 and #23429 in PIP-379 implementation.

lhotari avatar Oct 14 '24 11:10 lhotari