hedera-services feat: New Metrics Framework supporting Labels

Description:

This PR proposes a new metrics framework that supports the labels API. Adding labels to the existing framework (even if not breaking the API) will affect all places where metrics are used. Here are some other improvements:

Modular architecture (api and external packages) and pluggable exporters
No Prometheus library dependency and direct exposition of metrics in OpenMetrics format
Reusable snapshot objects to reduce memory footprint during export

Related issue(s): #20088

Notes for reviewer: This is now a draft PR to get early feedback. It is recommended to start review with documentation located in /docs folder of core module. How metrics are created and exported in OpenMetrics text format can be seen in OpenMetricsExportComprehensiveTest. Below are two checklists: one for this PR and another for any other items that could be done later with follow-up PRs. There is also a list of open topics to discuss regarding the framework - find comments with the Open Topic header and feel free to join the conversation.

PR Checklist

[x] Documented (Code comments, README, etc.)
[ ] Unit tests coverage
[x] JMH benchmarks (memory leaks and comparison to Prometheus framework)
[ ] Gradle configuration is correct for new modules (gradle/aggregation/build.gradle.kts, etc)
[ ] Approved by Architecture Team

Follow-up Items

Custom statistics in the org.hiero.metrics.api.stat package have to be revisited and improved
Remove dependency on log4j and use JUL.
Consider using fixed-point instead of floating-point double
CSV file exporter can be optimized to write into 3 files for efficiency: metrics metadata, datapoint metadata, datapoints time series
OpenMetrics HTTP endpoint has to be protected with authentication
Remove old proposal and clean up GitHub-related issues.

Oct 09 '25 20:10 akugal

:white_check_mark: Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
:white_check_mark:	Open Source Security	0	0	0	0	0 issues

:computer: Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Oct 09 '25 20:10 lfdt-bot

Codecov Report

:x: Patch coverage is 89.42037% with 188 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...org/hiero/metrics/api/stat/FrequencyMovingAvg.java	0.00%	35 Missing :warning:
...t/extension/writer/OpenMetricsSnapshotsWriter.java	83.76%	20 Missing and 11 partials :warning:
.../org/hiero/metrics/api/stat/MovingAverageStat.java	0.00%	30 Missing :warning:
...hiero/metrics/internal/GenericDoubleGaugeImpl.java	0.00%	10 Missing :warning:
...hiero/metrics/api/export/MetricsExportManager.java	86.76%	5 Missing and 4 partials :warning:
...rt/extension/writer/CsvMetricsSnapshotsWriter.java	91.30%	4 Missing and 4 partials :warning:
...s/internal/export/DefaultMetricsExportManager.java	91.78%	2 Missing and 4 partials :warning:
...hiero/metrics/internal/JvmMetricsRegistration.java	86.11%	2 Missing and 3 partials :warning:
...o/metrics/openmetrics/OpenMetricsHttpEndpoint.java	90.90%	2 Missing and 3 partials :warning:
...ro/metrics/api/datapoint/DoubleGaugeDataPoint.java	0.00%	4 Missing :warning:
... and 27 more

@@             Coverage Diff              @@
##               main   #21522      +/-   ##
============================================
+ Coverage     70.95%   71.25%   +0.29%     
- Complexity    24606    25126     +520     
============================================
  Files          2693     2793     +100     
  Lines        104921   106698    +1777     
  Branches      11032    11189     +157     
============================================
+ Hits          74448    76026    +1578     
- Misses        26453    26609     +156     
- Partials       4020     4063      +43

Files with missing lines	Coverage Δ	Complexity Δ
.../main/java/org/hiero/metrics/api/BooleanGauge.java	`100.00% <100.00%> (ø)`	`3.00 <3.00> (?)`
...main/java/org/hiero/metrics/api/DoubleCounter.java	`100.00% <100.00%> (ø)`	`3.00 <3.00> (?)`
.../main/java/org/hiero/metrics/api/GaugeAdapter.java	`100.00% <100.00%> (ø)`	`3.00 <3.00> (?)`
.../java/org/hiero/metrics/api/StatsGaugeAdapter.java	`100.00% <100.00%> (ø)`	`2.00 <2.00> (?)`
...java/org/hiero/metrics/api/core/ArrayAccessor.java	`100.00% <100.00%> (ø)`	`1.00 <1.00> (?)`
...iero/metrics/api/core/IdempotentMetricsBinder.java	`100.00% <100.00%> (ø)`	`4.00 <4.00> (?)`
...rc/main/java/org/hiero/metrics/api/core/Label.java	`100.00% <100.00%> (ø)`	`3.00 <3.00> (?)`
...ain/java/org/hiero/metrics/api/core/MetricKey.java	`100.00% <100.00%> (ø)`	`13.00 <13.00> (?)`
...ava/org/hiero/metrics/api/core/MetricMetadata.java	`100.00% <100.00%> (ø)`	`3.00 <3.00> (?)`
...in/java/org/hiero/metrics/api/core/MetricType.java	`100.00% <100.00%> (ø)`	`1.00 <1.00> (?)`
... and 90 more

... and 15 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Oct 09 '25 22:10 codecov[bot]

I spent a considerable amount of time reviewing the classes and the Javadocs, but I was unable to determine how to use them without asking. And I mean basic stuff like creating a metric and setting a value. In my opinion, the API is anything but intuitive.
I do not really understand why we need to start from scratch with zero compatibility. The migration would require a considerable amount of time, because metrics are used everywhere. I am pretty sure that nobody has scheduled time for this.
If the goal was to add labels, this could have been done in a fairly non-disruptive way, and the implementation could have been replaced piece by piece underneath without being noticed by other teams. Instead, we are going all-in with a big bang migration.

In my opinion, we should reconsider the whole strategy.

Nov 03 '25 13:11 netopyr

@netopyr

I spent a considerable amount of time reviewing the classes and the Javadocs, but I was unable to determine how to use them without asking. And I mean basic stuff like creating a metric and setting a value. In my opinion, the API is anything but intuitive.

You can find examples of usage in OpenMetricsExportComprehensiveTest. Can you be more specific about API and problems you see?

I do not really understand why we need to start from scratch with zero compatibility. The migration would require a considerable amount of time, because metrics are used everywhere.

If the goal was to add labels, this could have been done in a fairly non-disruptive way, and the implementation could have been replaced piece by piece underneath without being noticed by other teams. Instead, we are going all-in with a big bang migration.

Yes, it is possible to add the labels API to the existing framework, and even without breaking the existing client code, but since metrics are used everywhere, that could be a risky change, because labels are a pretty significant internal change. And this approach is actually a big bang, because it affects all the modules right away. Labels add another dimension to the metrics, and changing usage from metrics without labels (basically using labels inside their metric names) to using labels anyway requires time, and this is the goal to benefit from labels by using them. Also, migration is not supposed to be all or nothing - we can migrate the metrics of a single module by module, having two frameworks in place until everything is replaced.

Additionally, the new framework reduces dependencies on other modules and shows better memory consumption test results, even compared to Prometheus, while our existing framework has duplicates for all metrics in Prometheus through adapters, which, in my opinion, is excessive memory consumption. Test and results will be committed soon.

Nov 03 '25 15:11 akugal

Below are the results of the HTTP endpoint benchmark test. Currently, we have about 5k metrics in CN mainnet - this is the same 5k datapoints if we had labels. The test is using 10k data points with labels and compares the next frameworks:

Old Prometheus - simpleclient library that is currently used by our existing metrics framework
New Prometheus - new version of the library to report metrics to Prometheus.
New Framework - new metrics framework from this PR

Test runs 3 mins during which we repeatedly update a random datapoint and call the HTTP endpoint to perform export in OpenMetrics text format using GZIP.

Time comparison:

Benchmark                                                   (dataPointsCount)          (frameworkName)  (labelsBound)  (useGzip)  Mode  Cnt   Score   Error  Units
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint              10000                  default              4       true  avgt  180   8.634 ± 0.076  ms/op
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint              10000  prometheus-simpleclient              4       true  avgt  180  23.473 ± 0.109  ms/op
OpenMetricsHttpComparisonBenchmark.callOpenMetricsEndpoint              10000               prometheus              4       true  avgt  180  17.507 ± 0.097  ms/op

Memory comparison:

frameworks_memory_comparison

Nov 03 '25 23:11 akugal

Below is the memory footprint of the test, which simulates a real application using the new metrics framework. It runs 10 mins using 8 threads working with 10k datapoints. Each thread randomly updates some datapoints (random number in the range [0, 10) and sleeps for a random amount of time in the range [100ms, 300ms). Every 3 seconds HTTP call is performed to export metrics in OpenMetrics text format with GZIP compression:

Nov 04 '25 00:11 akugal