pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

Add observability metrics for CommandPartitionedTopicMetadata requests

Open lhotari opened this issue 3 years ago • 7 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

Currently, there's no way to track CommandPartitionedTopicMetadata requests. There's no metrics or logs that indicate that a broker is handling CommandPartitionedTopicMetadata requests.

Misconfigured clients might flood brokers with CommandPartitionedTopicMetadata requests and cause high CPU consumption.

One example of this is misconfiguration of splunk-otel-collector's Pulsar exporter. The example config configures pulsar-client-go's PartitionsAutoDiscoveryInterval setting to 1 nanosecond. I have sent a PR to fix the example config with https://github.com/signalfx/splunk-otel-collector/pull/2185 . This example shows that it's easy to mix the units and misconfigure a Pulsar client.

Solution

Add observability metrics for CommandPartitionedTopicMetadata requests, similar to what there is for lookup requests added by #8272.

Alternatives

No response

Anything else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

lhotari avatar Oct 28 '22 09:10 lhotari

currently, we have metadata store metrics, if it could meet your needs, I'd like to handle the issue. @lhotari

tjiuming avatar Oct 28 '22 13:10 tjiuming

currently, we have metadata store metrics, if it could meet your needs, I'd like to handle the issue. @lhotari

How are metadata store metrics used currently? I think it could be a breaking change if CommandPartitionedTopicMetadata requests are tracked as part of some other metric. I think it should be a new metric that is unique for CommandPartitionedTopicMetadata requests. @codelipenghui do you have a suggestion?

lhotari avatar Oct 28 '22 15:10 lhotari

How are metadata store metrics used currently? I think it could be a breaking change if CommandPartitionedTopicMetadata requests are tracked as part of some other metric. I think it should be a new metric that is unique for CommandPartitionedTopicMetadata requests. @codelipenghui do you have a suggestion?

The metadata store metrics are on the metadata store level which can provide the metastore operation latency. The REST API request metrics should be a separate part. The CommandPartitionedTopicMetadata requests metrics should not 100% equal to the metadata store operation. Maybe the jetty thread is blocked somewhere.

I think maybe jetty already provides the ability to expose the metrics with the request path label?

codelipenghui avatar Oct 31 '22 02:10 codelipenghui

@codelipenghui @lhotari There are 2 ways to get PartitionedTopicMetadata, one is ServerCnx#handlePartitionMetadataRequest(CommandPartitionedTopicMetadata partitionMetadata), another one is PersistentTopics#getPartitionedMetadata(Args ...) if we need to add metrics for them, please assign the issue to me

tjiuming avatar Oct 31 '22 10:10 tjiuming

How are metadata store metrics used currently? I think it could be a breaking change if CommandPartitionedTopicMetadata requests are tracked as part of some other metric. I think it should be a new metric that is unique for CommandPartitionedTopicMetadata requests. @codelipenghui do you have a suggestion?

The metadata store metrics are on the metadata store level which can provide the metastore operation latency. The REST API request metrics should be a separate part. The CommandPartitionedTopicMetadata requests metrics should not 100% equal to the metadata store operation. Maybe the jetty thread is blocked somewhere.

I think maybe jetty already provides the ability to expose the metrics with the request path label?

I've checked jetty, seems there is no such ability. if we want the ability, it's not easy. because we need to converge the request path. such as: /api/v2/persistent/myTenant/myNamespace/partitioned -> /api/v2/persistent/{tenant}/{namespace}/partitioned. it may takes some time

tjiuming avatar Oct 31 '22 11:10 tjiuming

@lhotari @codelipenghui PTAL https://github.com/apache/pulsar/pull/18281

tjiuming avatar Nov 01 '22 11:11 tjiuming

The PIP discuss thread: https://lists.apache.org/thread/sybl4nno4503w42hzt7b5lsyk6m2rbo6

tjiuming avatar Nov 03 '22 10:11 tjiuming

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] avatar Dec 05 '22 02:12 github-actions[bot]