feat: New Metrics Framework supporting Labels
Description:
This PR proposes a new metrics framework that supports the labels API. Adding labels to the existing framework (even if not breaking the API) will affect all places where metrics are used. Here are some other improvements:
- Modular architecture (api and external packages) and pluggable exporters
- No Prometheus library dependency and direct exposition of metrics in OpenMetrics format
- Reusable snapshot objects to reduce memory footprint during export
Related issue(s): #20088
Notes for reviewer:
This is now a draft PR to get early feedback.
It is recommended to start review with documentation located in /docs folder of core module.
How metrics are created and exported in OpenMetrics text format can be seen in OpenMetricsExportComprehensiveTest.
Below are two checklists: one for this PR and another for any other items that could be done later with follow-up PRs.
There is also a list of open topics to discuss regarding the framework - find comments with the Open Topic header and feel free to join the conversation.
PR Checklist
- [x] Documented (Code comments, README, etc.)
- [ ] Unit tests coverage
- [x] JMH benchmarks (memory leaks and comparison to Prometheus framework)
- [ ] Gradle configuration is correct for new modules (
gradle/aggregation/build.gradle.kts, etc) - [ ] Approved by Architecture Team
Follow-up Items
- Custom statistics in the
org.hiero.metrics.api.statpackage have to be revisited and improved - Remove dependency on log4j and use JUL.
- Consider using fixed-point instead of floating-point double
- CSV file exporter can be optimized to write into 3 files for efficiency: metrics metadata, datapoint metadata, datapoints time series
- OpenMetrics HTTP endpoint has to be protected with authentication
- Remove old proposal and clean up GitHub-related issues.
:white_check_mark: Snyk checks have passed. No issues have been found so far.
| Status | Scanner | Total (0) | ||||
|---|---|---|---|---|---|---|
| :white_check_mark: | Open Source Security | 0 | 0 | 0 | 0 | 0 issues |
:computer: Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.
Codecov Report
:x: Patch coverage is 89.42037% with 188 lines in your changes missing coverage. Please review.
@@ Coverage Diff @@
## main #21522 +/- ##
============================================
+ Coverage 70.95% 71.25% +0.29%
- Complexity 24606 25126 +520
============================================
Files 2693 2793 +100
Lines 104921 106698 +1777
Branches 11032 11189 +157
============================================
+ Hits 74448 76026 +1578
- Misses 26453 26609 +156
- Partials 4020 4063 +43
| Files with missing lines | Coverage Δ | Complexity Δ | |
|---|---|---|---|
| .../main/java/org/hiero/metrics/api/BooleanGauge.java | 100.00% <100.00%> (ø) |
3.00 <3.00> (?) |
|
| ...main/java/org/hiero/metrics/api/DoubleCounter.java | 100.00% <100.00%> (ø) |
3.00 <3.00> (?) |
|
| .../main/java/org/hiero/metrics/api/GaugeAdapter.java | 100.00% <100.00%> (ø) |
3.00 <3.00> (?) |
|
| .../java/org/hiero/metrics/api/StatsGaugeAdapter.java | 100.00% <100.00%> (ø) |
2.00 <2.00> (?) |
|
| ...java/org/hiero/metrics/api/core/ArrayAccessor.java | 100.00% <100.00%> (ø) |
1.00 <1.00> (?) |
|
| ...iero/metrics/api/core/IdempotentMetricsBinder.java | 100.00% <100.00%> (ø) |
4.00 <4.00> (?) |
|
| ...rc/main/java/org/hiero/metrics/api/core/Label.java | 100.00% <100.00%> (ø) |
3.00 <3.00> (?) |
|
| ...ain/java/org/hiero/metrics/api/core/MetricKey.java | 100.00% <100.00%> (ø) |
13.00 <13.00> (?) |
|
| ...ava/org/hiero/metrics/api/core/MetricMetadata.java | 100.00% <100.00%> (ø) |
3.00 <3.00> (?) |
|
| ...in/java/org/hiero/metrics/api/core/MetricType.java | 100.00% <100.00%> (ø) |
1.00 <1.00> (?) |
|
| ... and 90 more |
... and 15 files with indirect coverage changes
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- I spent a considerable amount of time reviewing the classes and the Javadocs, but I was unable to determine how to use them without asking. And I mean basic stuff like creating a metric and setting a value. In my opinion, the API is anything but intuitive.
- I do not really understand why we need to start from scratch with zero compatibility. The migration would require a considerable amount of time, because metrics are used everywhere. I am pretty sure that nobody has scheduled time for this.
- If the goal was to add labels, this could have been done in a fairly non-disruptive way, and the implementation could have been replaced piece by piece underneath without being noticed by other teams. Instead, we are going all-in with a big bang migration.
In my opinion, we should reconsider the whole strategy.
@netopyr
- I spent a considerable amount of time reviewing the classes and the Javadocs, but I was unable to determine how to use them without asking. And I mean basic stuff like creating a metric and setting a value. In my opinion, the API is anything but intuitive.
You can find examples of usage in OpenMetricsExportComprehensiveTest. Can you be more specific about API and problems you see?
- I do not really understand why we need to start from scratch with zero compatibility. The migration would require a considerable amount of time, because metrics are used everywhere.
- If the goal was to add labels, this could have been done in a fairly non-disruptive way, and the implementation could have been replaced piece by piece underneath without being noticed by other teams. Instead, we are going all-in with a big bang migration.
Yes, it is possible to add the labels API to the existing framework, and even without breaking the existing client code, but since metrics are used everywhere, that could be a risky change, because labels are a pretty significant internal change. And this approach is actually a big bang, because it affects all the modules right away. Labels add another dimension to the metrics, and changing usage from metrics without labels (basically using labels inside their metric names) to using labels anyway requires time, and this is the goal to benefit from labels by using them. Also, migration is not supposed to be all or nothing - we can migrate the metrics of a single module by module, having two frameworks in place until everything is replaced.
Additionally, the new framework reduces dependencies on other modules and shows better memory consumption test results, even compared to Prometheus, while our existing framework has duplicates for all metrics in Prometheus through adapters, which, in my opinion, is excessive memory consumption. Test and results will be committed soon.
Below are the results of the HTTP endpoint benchmark test. Currently, we have about 5k metrics in CN mainnet - this is the same 5k datapoints if we had labels. The test is using 10k data points with labels and compares the next frameworks:
-
Old Prometheus -
simpleclientlibrary that is currently used by our existing metrics framework - New Prometheus - new version of the library to report metrics to Prometheus.
- New Framework - new metrics framework from this PR
Test runs 3 mins during which we repeatedly update a random datapoint and call the HTTP endpoint to perform export in OpenMetrics text format using GZIP.
Time comparison:
Benchmark (dataPointsCount) (frameworkName) (labelsBound) (useGzip) Mode Cnt Score Error Units
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint 10000 default 4 true avgt 180 8.634 ± 0.076 ms/op
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint 10000 prometheus-simpleclient 4 true avgt 180 23.473 ± 0.109 ms/op
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint 10000 prometheus 4 true avgt 180 17.507 ± 0.097 ms/op
Memory comparison:
Below is the memory footprint of the test, which simulates a real application using the new metrics framework. It runs 10 mins using 8 threads working with 10k datapoints. Each thread randomly updates some datapoints (random number in the range [0, 10) and sleeps for a random amount of time in the range [100ms, 300ms). Every 3 seconds HTTP call is performed to export metrics in OpenMetrics text format with GZIP compression: