hedera-services icon indicating copy to clipboard operation
hedera-services copied to clipboard

feat: New Metrics Framework supporting Labels

Open akugal opened this issue 3 months ago • 6 comments

Description:

This PR proposes a new metrics framework that supports the labels API. Adding labels to the existing framework (even if not breaking the API) will affect all places where metrics are used. Here are some other improvements:

  • Modular architecture (api and external packages) and pluggable exporters
  • No Prometheus library dependency and direct exposition of metrics in OpenMetrics format
  • Reusable snapshot objects to reduce memory footprint during export

Related issue(s): #20088

Notes for reviewer: This is now a draft PR to get early feedback. It is recommended to start review with documentation located in /docs folder of core module. How metrics are created and exported in OpenMetrics text format can be seen in OpenMetricsExportComprehensiveTest. Below are two checklists: one for this PR and another for any other items that could be done later with follow-up PRs. There is also a list of open topics to discuss regarding the framework - find comments with the Open Topic header and feel free to join the conversation.

PR Checklist

  • [x] Documented (Code comments, README, etc.)
  • [ ] Unit tests coverage
  • [x] JMH benchmarks (memory leaks and comparison to Prometheus framework)
  • [ ] Gradle configuration is correct for new modules (gradle/aggregation/build.gradle.kts, etc)
  • [ ] Approved by Architecture Team

Follow-up Items

  • Custom statistics in the org.hiero.metrics.api.stat package have to be revisited and improved
  • Remove dependency on log4j and use JUL.
  • Consider using fixed-point instead of floating-point double
  • CSV file exporter can be optimized to write into 3 files for efficiency: metrics metadata, datapoint metadata, datapoints time series
  • OpenMetrics HTTP endpoint has to be protected with authentication
  • Remove old proposal and clean up GitHub-related issues.

akugal avatar Oct 09 '25 20:10 akugal

:white_check_mark: Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
:white_check_mark: Open Source Security 0 0 0 0 0 issues

:computer: Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

lfdt-bot avatar Oct 09 '25 20:10 lfdt-bot

Codecov Report

:x: Patch coverage is 89.42037% with 188 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...org/hiero/metrics/api/stat/FrequencyMovingAvg.java 0.00% 35 Missing :warning:
...t/extension/writer/OpenMetricsSnapshotsWriter.java 83.76% 20 Missing and 11 partials :warning:
.../org/hiero/metrics/api/stat/MovingAverageStat.java 0.00% 30 Missing :warning:
...hiero/metrics/internal/GenericDoubleGaugeImpl.java 0.00% 10 Missing :warning:
...hiero/metrics/api/export/MetricsExportManager.java 86.76% 5 Missing and 4 partials :warning:
...rt/extension/writer/CsvMetricsSnapshotsWriter.java 91.30% 4 Missing and 4 partials :warning:
...s/internal/export/DefaultMetricsExportManager.java 91.78% 2 Missing and 4 partials :warning:
...hiero/metrics/internal/JvmMetricsRegistration.java 86.11% 2 Missing and 3 partials :warning:
...o/metrics/openmetrics/OpenMetricsHttpEndpoint.java 90.90% 2 Missing and 3 partials :warning:
...ro/metrics/api/datapoint/DoubleGaugeDataPoint.java 0.00% 4 Missing :warning:
... and 27 more

Impacted file tree graph

@@             Coverage Diff              @@
##               main   #21522      +/-   ##
============================================
+ Coverage     70.95%   71.25%   +0.29%     
- Complexity    24606    25126     +520     
============================================
  Files          2693     2793     +100     
  Lines        104921   106698    +1777     
  Branches      11032    11189     +157     
============================================
+ Hits          74448    76026    +1578     
- Misses        26453    26609     +156     
- Partials       4020     4063      +43     
Files with missing lines Coverage Δ Complexity Δ
.../main/java/org/hiero/metrics/api/BooleanGauge.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
...main/java/org/hiero/metrics/api/DoubleCounter.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
.../main/java/org/hiero/metrics/api/GaugeAdapter.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
.../java/org/hiero/metrics/api/StatsGaugeAdapter.java 100.00% <100.00%> (ø) 2.00 <2.00> (?)
...java/org/hiero/metrics/api/core/ArrayAccessor.java 100.00% <100.00%> (ø) 1.00 <1.00> (?)
...iero/metrics/api/core/IdempotentMetricsBinder.java 100.00% <100.00%> (ø) 4.00 <4.00> (?)
...rc/main/java/org/hiero/metrics/api/core/Label.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
...ain/java/org/hiero/metrics/api/core/MetricKey.java 100.00% <100.00%> (ø) 13.00 <13.00> (?)
...ava/org/hiero/metrics/api/core/MetricMetadata.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
...in/java/org/hiero/metrics/api/core/MetricType.java 100.00% <100.00%> (ø) 1.00 <1.00> (?)
... and 90 more

... and 15 files with indirect coverage changes

Impacted file tree graph

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 09 '25 22:10 codecov[bot]

  • I spent a considerable amount of time reviewing the classes and the Javadocs, but I was unable to determine how to use them without asking. And I mean basic stuff like creating a metric and setting a value. In my opinion, the API is anything but intuitive.
  • I do not really understand why we need to start from scratch with zero compatibility. The migration would require a considerable amount of time, because metrics are used everywhere. I am pretty sure that nobody has scheduled time for this.
  • If the goal was to add labels, this could have been done in a fairly non-disruptive way, and the implementation could have been replaced piece by piece underneath without being noticed by other teams. Instead, we are going all-in with a big bang migration.

In my opinion, we should reconsider the whole strategy.

netopyr avatar Nov 03 '25 13:11 netopyr

@netopyr

  • I spent a considerable amount of time reviewing the classes and the Javadocs, but I was unable to determine how to use them without asking. And I mean basic stuff like creating a metric and setting a value. In my opinion, the API is anything but intuitive.

You can find examples of usage in OpenMetricsExportComprehensiveTest. Can you be more specific about API and problems you see?

  • I do not really understand why we need to start from scratch with zero compatibility. The migration would require a considerable amount of time, because metrics are used everywhere.
  • If the goal was to add labels, this could have been done in a fairly non-disruptive way, and the implementation could have been replaced piece by piece underneath without being noticed by other teams. Instead, we are going all-in with a big bang migration.

Yes, it is possible to add the labels API to the existing framework, and even without breaking the existing client code, but since metrics are used everywhere, that could be a risky change, because labels are a pretty significant internal change. And this approach is actually a big bang, because it affects all the modules right away. Labels add another dimension to the metrics, and changing usage from metrics without labels (basically using labels inside their metric names) to using labels anyway requires time, and this is the goal to benefit from labels by using them. Also, migration is not supposed to be all or nothing - we can migrate the metrics of a single module by module, having two frameworks in place until everything is replaced.

Additionally, the new framework reduces dependencies on other modules and shows better memory consumption test results, even compared to Prometheus, while our existing framework has duplicates for all metrics in Prometheus through adapters, which, in my opinion, is excessive memory consumption. Test and results will be committed soon.

akugal avatar Nov 03 '25 15:11 akugal

Below are the results of the HTTP endpoint benchmark test. Currently, we have about 5k metrics in CN mainnet - this is the same 5k datapoints if we had labels. The test is using 10k data points with labels and compares the next frameworks:

  • Old Prometheus - simpleclient library that is currently used by our existing metrics framework
  • New Prometheus - new version of the library to report metrics to Prometheus.
  • New Framework - new metrics framework from this PR

Test runs 3 mins during which we repeatedly update a random datapoint and call the HTTP endpoint to perform export in OpenMetrics text format using GZIP.

Time comparison:

Benchmark                                                   (dataPointsCount)          (frameworkName)  (labelsBound)  (useGzip)  Mode  Cnt   Score   Error  Units
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint              10000                  default              4       true  avgt  180   8.634 ± 0.076  ms/op
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint              10000  prometheus-simpleclient              4       true  avgt  180  23.473 ± 0.109  ms/op
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint              10000               prometheus              4       true  avgt  180  17.507 ± 0.097  ms/op

Memory comparison:

frameworks_memory_comparison

akugal avatar Nov 03 '25 23:11 akugal

Below is the memory footprint of the test, which simulates a real application using the new metrics framework. It runs 10 mins using 8 threads working with 10k datapoints. Each thread randomly updates some datapoints (random number in the range [0, 10) and sleeps for a random amount of time in the range [100ms, 300ms). Every 3 seconds HTTP call is performed to export metrics in OpenMetrics text format with GZIP compression:

10-min-real-app-simulation

akugal avatar Nov 04 '25 00:11 akugal