pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve][broker]PIP-333 Add monitor metrics for the number of connections to client IPs and roles

Open yyj8 opened this issue 1 year ago • 12 comments

For detailed improvement instructions, please refer to issues: https://github.com/apache/pulsar/issues/21934

Motivation

Currently, Pulsar does not monitor the number of connections to client IPs and roles. When there are many business teams accessing the cluster, and there are also many topics and production consumers, it is difficult for us to quickly locate which IPs or roles are accessing the cluster abnormally from a global perspective and optimize them. So, we hope to have a monitoring indicator that can statistically analyze the connection information between IP and role dimensions, making it easier to quickly locate problems in the later stage.

Modifications

Add new metircs to ServerCnx.java class:

    private static final Gauge clientIpAndRoleConnections = Gauge.build()
            .name("pulsar_broker_client_ip_role_connections")
            .labelNames("ip", "role")
            .help("The number of client IP and role connections")
            .register();

Documentation

  • [ ] doc
  • [x] doc-required
  • [ ] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository: https://github.com/yyj8/pulsar/pull/7

yyj8 avatar Jan 20 '24 03:01 yyj8

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

github-actions[bot] avatar Jan 20 '24 03:01 github-actions[bot]

/pulsarbot run-failure-checks

yyj8 avatar Jan 20 '24 05:01 yyj8

This is a good way to kill Prometheus or any other TSDB. Maybe we can include it in broker stats of some sort. Also, you must limit this data structure. You can so easily be at a high number.

asafm avatar Jan 25 '24 11:01 asafm

Agree with @asafm the label name ip maybe will lead to toooo many label values.

dao-jun avatar Jan 26 '24 04:01 dao-jun

Dear @asafm @dao-jun ,Thank you for your help.

I mainly considered two scenarios:

  1. k8s container deployment: In this scenario, for pulsar, the perceived client IP is the IP of the k8s slave node, and the number of this IP is relatively small for most user teams, which may be a few or dozens of node scales. A very small number of users may reach a few hundred node scales, or even more.

  2. Virtual machine or physical machine deployment: In this scenario, for Pulsar, the perceived client IP is virtual machine or physical machine IP, and it is relatively rare for it to reach hundreds or even thousands of servers.

So, can we add a parameter exposeIpRoleLevelMetricsInPrometheus in theconf/broker.conf file to let users decide whether to enable this metrics. Just like the metrics of producers and consumers are exposed.

yyj8 avatar Jan 28 '24 13:01 yyj8

Can a role be enough? It should be low cardinality, I believe.

asafm avatar Jan 29 '24 14:01 asafm

Can a role be enough? It should be low cardinality, I believe.

Do you think the following implementation would be better?

private static final Gauge clientRoleConnections = Gauge.build()
            .name("pulsar_broker_client_role_connections")
            .labelNames("role")
            .help("The number of client role connections")
            .register();

The above metircs may not be able to quickly locate the specific server causing high connectivity in scenarios where the same role has connections on multiple servers. We have encountered this issue in our current production environment.

yyj8 avatar Jan 30 '24 04:01 yyj8

Just thinking out loud: Given you know the role and your app using the client can expose the role so you can pinpoint the hosts running the clients this way, will the number of connections from the client side suffice?

asafm avatar Jan 30 '24 10:01 asafm

Just thinking out loud: Given you know the role and your app using the client can expose the role so you can pinpoint the hosts running the clients this way, will the number of connections from the client side suffice?

Our current pain point is that multiple business teams are accessing the same pulsar cluster, and the current cluster traffic is not high (within 200MB/s), but there are a lot of topics (divided into multiple tenants according to actual customers, nearly 200000 topic partitions). The number of connections in the pulsar cluster has reached nearly 500000, resulting in frequent instances where the proxy or broker set connection count is full and all client requests are rejected, Because there is no place to obtain the connection status of the role dimension or IP dimension summary, it is impossible to find which business is experiencing abnormal access and promote its optimization.

Main scenarios:

  1. Multiple business teams use different roles to access the same pulsar cluster on different client machines;
  2. The same business team uses the same role to access the same pulsar cluster on multiple client machines or multiple apps;

Only with both of these pieces of information can we accurately locate which team had an abnormal access.

yyj8 avatar Jan 31 '24 04:01 yyj8

@yyj8 Just validating your requirements - can you elaborate on why monitoring the connections count from the client side and not from the broker side would be impossible?

asafm avatar Jan 31 '24 08:01 asafm

it is totally hard to make the decision, the metric might be helpful to troubleshooting.

but pulsar broker has tooo many metrics and tooo many configuration items, and if we add ip label for the metric, it could easily kill the time series DB.

Since we can't convince each other, I suggest send a discuss to the mail list, let the community decides.

dao-jun avatar Jan 31 '24 09:01 dao-jun

it is totally hard to make the decision, the metric might be helpful to troubleshooting.

but pulsar broker has tooo many metrics and tooo many configuration items, and if we add ip label for the metric, it could easily kill the time series DB.

Since we can't convince each other, I suggest send a discuss to the mail list, let the community decides.

mail list:https://lists.apache.org/thread/6oyhv6fmzr72oc1hcxw9llwvzlw09cqp

yyj8 avatar Feb 01 '24 02:02 yyj8