pulsar
pulsar copied to clipboard
[fix][broker][functions-worker] Ensure prometheus metrics are grouped by type (#8407, #13865)
Fixes #8407 Fixes #13865
Motivation
Current broker prometheus metrics are not grouped by metric type which causes issues in systems that read these metrics (e.g. DataDog).
Prometheus docs states "All lines for a given metric must be provided as one single group" - https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting
Modifications Updated the namespace and topic prometheus metric generators to group the metrics under the appropriate type header. Updated function worker stats to include TYPE headers
Verifying this change
- [x] Make sure that the change passes the CI checks.
This change added tests and can be verified as follows:
Added unit test to verify all metrics are grouped under correct type header
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API: (no)
- The schema: (no)
- The default values of configurations: (no)
- The wire protocol: (no)
- The rest endpoints: (no)
- The admin cli options: (no)
- Anything that affects deployment: (no)
Documentation
Need to update docs?
- [X]
doc-not-neededChanges to match prometheus spec
@codelipenghui can you look at this?
@eolivelli could you please approve the workflows and review again
/pulsarbot run-failure-checks
/pulsarbot run-failure-checks
/pulsarbot run cpp-tests
@asafm I thought of another way of grouping these metrics, how about write metrics-data into a same ByteBuf which has the same metric name?
In the finally, merge these ByteBufs into stream, seems it won't takes too much costs.
Besides, over all looks good to me.
@tjiuming Maybe you meant @marksilcox ?
I have a note regarding the way we write NamespaceStatsAggregator.
The whole idea in the way the metric system is built today in Pulsar (I know it's fractured) is to try to avoid memory allocation as much as possible. It does so by:
- Using
longandAtomicLongFieldUpdaterinstead ofAtomicLong - Writing the values directly into ByteBuf and then to the stream.
- In some cases directly into the stream
In your refactor, we add a Sample object each containing 2 lists and a value (Double). On the way, we create a Map, and of course a MetricFamilySamples object. The sample is done per metrics we have.
I'm just starting to learn the metrics code in Pulsar, but I think it beats the point of the original code. I believe @merlimat can correct me.
What I suggest is to build the code to achieve the goal you desire, each sample belonging to a specific metric will be written one after another.
How?
Let's collect all TopicStats. We pay the cost of allocating them anyway. Once we have that, we can write a function writing each individual metric samples. Each will call a function that looks like:
writeMetric(stream, "msgsRate", cluster, namespace, topicToStatsMap, topicStats -> topicStats.msgsRate)
WDYT?
@asafm
I may be missing something, but to be able to print the metrics in the Prometheus format we would still need to collate all TopicStats and all AggregatedNamespaceStats as they share metric types (e.g. there is a pulsar_rate_in metric for both the namespace and all topics in that namespace). Then aggregate across all by metric name.
There are usages elsewhere in Pulsar of MetricFamilySamples, which is why I went down this route rather than creating a new model to hold these relationships.
NamespaceStats and TopicStat do share metrics as they use the same metrics but aggregate to different level. One is raw, at topic level, and the other is simple aggregation to namespace level. The includeTopics flag is actually an either: includeTopics actually means, print only topic level metrics, while false means print only namespace level metrics. So there's no collision then no?
I definitely agree that for the developer you'll be spending more time writing this and checking for collisions, and maybe printing from one class one variable and then another class same variable, but you will obtain the end result of not increasing memory allocations. The more I read into Pulsar metrics code, the more I see techniques used to decrease memory allocations. Just today I saw in TransactionsAggregator that they avoid allocating objects and keeping the object in ThreadLocal and reset it every time they need it, like object pool, but with a fixed size (number of threads).
I guess @merlimat and @codelipenghui can contribute their knowledge here on how important that is, as I'm fairly new to this project.
@marksilcox Please provide a correct documentation label for your PR. Instructions see Pulsar Documentation Label Guide.
The problem we have is that we need to collect by metric name, and not by topic/namespace as is currently happening. So for a metric such as pulsar_rate_in whether includeTopics is enabled or not we need all values of this metric grouped under a single type header.
This means whatever is used needs to store each metric in a way so other namespaces or topics can add there values. I guess what I need to find is the most memory efficient way of managing this.
The problem we have is that we need to collect by metric name, and not by topic/namespace as is currently happening. So for a metric such as
pulsar_rate_inwhetherincludeTopicsis enabled or not we need all values of this metric grouped under a single type header.This means whatever is used needs to store each metric in a way so other namespaces or topics can add there values. I guess what I need to find is the most memory efficient way of managing this.
I believe you can stick to the same data structures used now. Just when printing it, you iterate over the different data structures. For example, pulsa_rate_in, you iterate over the the TopicStats, grab the value from there and print it. Do that for every metric.
@asafm I have now updated following the pattern you suggest. I'm not totally familiar with FastThreadLocal but I think I have minimised the object creation where possible. Would be grateful for some further pointers if this could be improved
/pulsarbot run-failure-checks
@asafm That makes sense, I hadn't thought about the impact with that volume of topics (we currently have relatively few in comparison). I will look at using ByteBuf as suggested.
/pulsarbot run-failure-checks
@merlimat Sorry to nag again - could you please take a look at the comments I mentioned you at? I want to be certain here.
LGTM
/pulsarbot run-failure-checks
All looks good on my end :) I'll try to find someone with write permission to wrap this up
@eolivelli Do you want to re-review your requested changes?
/pulsarbot run-failure-checks
@marksilcox I'm compiling a document detailing exactly how the metric system works today in Pulsar, and while doing so I've come to understand how Pulsar Function metrics work. The bad news: They also violate the grouping of the metric name. You will only experience it when you scrape Pulsar Function Worker process metrics or if you run it embedded within a Pulsar Broker.
The code is in FunctionsStatsGenerator.
So, the function there iterates over the different function runtimes. For example, if this worker runs function A with a replication factor of 2, and mode Process, it will spin up two operating system processes running JavaInstanceStarter which executes the function. Inside it runs a Prometheus HTTPServer which exposes its Prometheus metrics on a defined metrics port. Each process exposes the same metric name, so when FunctionsStatsGenerator aggregates them, it simply dumps one output into the same stream. This will probably require a separate PR with a different fix.
@asafm Not sure I will have the bandwidth to look at FunctionsStatsGenerator anytime soon
@marksilcox Your work was great. We can settle for opening an issue and linking this PR to it - what do you say?
Just need to see how to get those tests green. @nlu90 How can we move forward?
/pulsar-bot rerun-failure-checks
there is a checkstyle violation:
[INFO] There is 1 error reported by Checkstyle 8.37 with ../buildtools/src/main/resources/pulsar/checkstyle.xml ruleset. Error: src/main/java/org/apache/pulsar/broker/stats/prometheus/NamespaceStatsAggregator.java:[372] (regexp) RegexpSingleline: Trailing whitespace [INFO] ------------------------------------------------------------------------
@marksilcox Your work was great. We can settle for opening an issue and linking this PR to it - what do you say?
@asafm there is already an open issue (#8407) for this, or is a new one required?