prom-client
prom-client copied to clipboard
Add openmetrics and exemplars support

Overview
- Added support for the OpenMetrics standard
- Added support for automatically providing exemplars populated with trace information from OpenTelemetry
The main purpose of the PR is to enable the use of Exemplars for NodeJS codebase.
Relevant resources:
Grafana exemplars Prometheus exemplars OpenMetics spec Prometheus format
Design
The OpenTelemetry standard is very close to the original Prometheus format. In order to keep the library as backwards compatible as possible the default format is kept unchanged (Prometheus) and the use of exemplars is disabled.
The new features can be toggled:
- The format at the registry level (prometheus/openmetrics)
- The exemplars at the metric level
Each registry instance has an attribute (contentType) that will decide the format.
The two possible formats are defined by the constants OPENMETRICS_CONTENT_TYPE and PROMETHEUS_CONTENT_TYPE which contain the HTTP content type.
Future versions should default to the 1.0.0 version.
Each metric has a flag for enabling the exemplar, the flag is put on the metrics supeclass for simplicity, but out of the currently implemented metric types only histograms and counters can have exemplars.
The biggest change to the code is the creation of separate functions for Counter increment and Histogram observe. Because the functions need to support a third optional parameter (exemplar labels) I have changed the way parameters are passed to the functions. Instead of using plain (label, value) the users will need to provide a single object with the format ({labels, value, exemplarLabels}).
The change should not impact existing users, but users who want to use exemplars will need to use the new call format.
Exemplar object
Timestamp - is the time when the exemplar was created Reference from the Golang client: https://github.com/prometheus/client_golang/blob/1b145cad6847a692bd07e872d64b7102d33213c6/prometheus/histogram.go#L432.
There is a hard 128 UTF-8 character limit on exemplar length.
The labels use for out of the box traces are traceId and spanId, it feels more like JavaScript to me, there is no other reason for the name choice. The golang implementation seems to be using traceID here and the Java impl. uses trace_id here. The label used for exemplars can be changed in Grafana.
Counters in OpenMetrics
Counters have a brekaing change in the form of an enforced _total suffix, it is not just a convention anymore. Examples:
Prometheus
# HELP mycounter help
# TYPE mycounter counter
mycounter 0
# HELP mycounter2_total help
# TYPE mycounter2_total counter
mycounter2_total 0
OpenMetrics
# HELP mycounter help
# TYPE mycounter counter
mycounter_total 0
Prometheus ignores the comments related to name and type, but the name of the metrics changes too and has the potential to break dashboards/alerts etc. The current implementation follows the same approach as the Java implementation here:
However, instead of applying the suffix at the level of the Counter object, this implementation applyes the change in the Registry object. The disadvantage is that the code is less elegant. The advantage is that the change is not breaking in any way for the existing users - the _total suffix will only be enforced by OpenMetrics registries, not Prometheus registries.
In the future, when OpenMetrics becomes more widely adopted, the behaviour can be moved inside Counter object and made mandatory.
Benchmarks
Benchmark tests were not changed. They are using the default registry type (Prometheus) and no exemplars, so it is a check to see the impact for current users of the library. Ran 4 tests (results in gists bellow). The highest impact was on the registry benchmark with a ~10-15% performance hit.
⚠ registry ➭ getMetricsAsJSON#1 with 64 is 5.345% acceptably slower.
⚠ registry ➭ getMetricsAsJSON#2 with 8 is 3.076% acceptably slower.
⚠ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 4.063% acceptably slower.
✓ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 0.1468% faster.
⚠ registry ➭ getMetricsAsJSON#6 with 2 is 4.137% acceptably slower.
✗ registry ➭ metrics#1 with 64 is 11.50% slower.
✓ registry ➭ metrics#2 with 8 is 0.2174% faster.
⚠ registry ➭ metrics#2 with 4 and 2 with 2 is 1.046% acceptably slower.
⚠ registry ➭ metrics#2 with 2 and 2 with 4 is 3.413% acceptably slower.
✗ registry ➭ metrics#6 with 2 is 15.03% slower.
⚠ histogram ➭ observe#1 with 64 is 0.2735% acceptably slower.
⚠ histogram ➭ observe#2 with 8 is 0.5325% acceptably slower.
⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 0.09087% acceptably slower.
⚠ histogram ➭ observe#2 with 2 and 2 with 4 is 0.3243% acceptably slower.
⚠ histogram ➭ observe#6 with 2 is 0.6302% acceptably slower.
✓ gauge ➭ inc is 16.96% faster.
⚠ gauge ➭ inc with labels is 1.991% acceptably slower.
⚠ summary ➭ observe#1 with 64 is 2.261% acceptably slower.
✓ summary ➭ observe#2 with 8 is 2.166% faster.
✓ summary ➭ observe#2 with 4 and 2 with 2 is 0.4552% faster.
⚠ summary ➭ observe#2 with 2 and 2 with 4 is 1.407% acceptably slower.
⚠ summary ➭ observe#6 with 2 is 1.116% acceptably slower.
https://gist.github.com/voltbit/55bfdafccb5a0458d0b2aff9703dae43 https://gist.github.com/voltbit/1e1097e6400638334e11da52fefcd5d4 https://gist.github.com/voltbit/41828df848a7132c1aad196414ea2d69 https://gist.github.com/voltbit/70539929453a2b95d4f9ac2df6707a9b
TODO
- [x] Complete test coverage of the new features
- [x] Performance tests
- [x] Add better examples and readme info
- [x] ~~Strategy for registry merge when there are different registry formats~~
- decided to consider the merge of two different types of registries undefined behaviour, the users should always use the same type - Prometheus or OpenMetrics - if merging
- [x] Hadling the
_totalsuffix on counters
Not implemented
- Support for the
_createdsuffix on any metrics
Very exciting, thanks for working on this!
Hi @zbjornson could you please trigger the tests again? The PR should be ready for review :).
CI is green! I'll try to review this this weekend and hopefully @SimenB and/or @siimon can also review soon.
thanks @voltbit & @zbjornson! we are excited about this feature!
Hi @zbjornson. Any news on that ?
Hi guys. Are there, by chance, any updates on how this is progressing and an estimate, hopefully?
Hi guys. Are there, by chance, any updates on how this is progressing and an estimate, hopefully?
Hi @dnutels I will start working again on this in a couple of weeks, I will implement the changes requested around mid May, but I cant work on it earlier.
New runs for benchmarks with the latest changes.
Expand for benchmark results
Summary:
✗ registry ➭ getMetricsAsJSON#1 with 64 is 11.43% slower. ✗ registry ➭ getMetricsAsJSON#2 with 8 is 155.0% slower. ✓ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 58.48% faster. ✗ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 23.80% slower. ⚠ registry ➭ getMetricsAsJSON#6 with 2 is 6.297% acceptably slower. ⚠ registry ➭ metrics#1 with 64 is 5.984% acceptably slower. ⚠ registry ➭ metrics#2 with 8 is 1.545% acceptably slower. ⚠ registry ➭ metrics#2 with 4 and 2 with 2 is 0.09495% acceptably slower. ✓ registry ➭ metrics#2 with 2 and 2 with 4 is 35.38% faster. ⚠ registry ➭ metrics#6 with 2 is 3.828% acceptably slower. ✓ histogram ➭ observe#1 with 64 is 0.2496% faster. ⚠ histogram ➭ observe#2 with 8 is 0.8472% acceptably slower. ⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 0.7701% acceptably slower. ✓ histogram ➭ observe#2 with 2 and 2 with 4 is 3.590% faster. ⚠ histogram ➭ observe#6 with 2 is 3.784% acceptably slower. ✓ gauge ➭ inc is 2.087% faster. ✓ gauge ➭ inc with labels is 0.2869% faster. ✓ summary ➭ observe#1 with 64 is 3.924% faster. ✓ summary ➭ observe#2 with 8 is 0.2697% faster. ✓ summary ➭ observe#2 with 4 and 2 with 2 is 1.130% faster. ✓ summary ➭ observe#2 with 2 and 2 with 4 is 0.4797% faster. ✓ summary ➭ observe#6 with 2 is 1.389% faster.
Summary:
✓ registry ➭ getMetricsAsJSON#1 with 64 is 1.990% faster. ✓ registry ➭ getMetricsAsJSON#2 with 8 is 321.0% faster. ⚠ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 8.586% acceptably slower. ✓ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 13.94% faster. ✗ registry ➭ getMetricsAsJSON#6 with 2 is 16.18% slower. ✓ registry ➭ metrics#1 with 64 is 20.45% faster. ⚠ registry ➭ metrics#2 with 8 is 0.8526% acceptably slower. ✓ registry ➭ metrics#2 with 4 and 2 with 2 is 5.139% faster. ⚠ registry ➭ metrics#2 with 2 and 2 with 4 is 6.645% acceptably slower. ✓ registry ➭ metrics#6 with 2 is 0.7693% faster. ⚠ histogram ➭ observe#1 with 64 is 3.102% acceptably slower. ✓ histogram ➭ observe#2 with 8 is 1.538% faster. ✓ histogram ➭ observe#2 with 4 and 2 with 2 is 0.4875% faster. ✓ histogram ➭ observe#2 with 2 and 2 with 4 is 0.7254% faster. ✓ histogram ➭ observe#6 with 2 is 2.165% faster. ✓ gauge ➭ inc is 22.42% faster. ⚠ gauge ➭ inc with labels is 1.116% acceptably slower. ✓ summary ➭ observe#1 with 64 is 2.565% faster. ⚠ summary ➭ observe#2 with 8 is 0.2203% acceptably slower. ⚠ summary ➭ observe#2 with 4 and 2 with 2 is 0.08091% acceptably slower. ✓ summary ➭ observe#2 with 2 and 2 with 4 is 1.113% faster. ✓ summary ➭ observe#6 with 2 is 1.459% faster.
Summary:
✗ registry ➭ getMetricsAsJSON#1 with 64 is 12.81% slower. ✗ registry ➭ getMetricsAsJSON#2 with 8 is 563.2% slower. ✓ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 7.191% faster. ⚠ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 8.622% acceptably slower. ✗ registry ➭ getMetricsAsJSON#6 with 2 is 13.17% slower. ✗ registry ➭ metrics#1 with 64 is 25.55% slower. ⚠ registry ➭ metrics#2 with 8 is 2.849% acceptably slower. ⚠ registry ➭ metrics#2 with 4 and 2 with 2 is 3.033% acceptably slower. ⚠ registry ➭ metrics#2 with 2 and 2 with 4 is 2.256% acceptably slower. ⚠ registry ➭ metrics#6 with 2 is 4.078% acceptably slower. ⚠ histogram ➭ observe#1 with 64 is 1.686% acceptably slower. ✓ histogram ➭ observe#2 with 8 is 0.3382% faster. ⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 0.2078% acceptably slower. ⚠ histogram ➭ observe#2 with 2 and 2 with 4 is 6.005% acceptably slower. ⚠ histogram ➭ observe#6 with 2 is 2.634% acceptably slower. ✓ gauge ➭ inc is 12.39% faster. ✓ gauge ➭ inc with labels is 2.485% faster. ✓ summary ➭ observe#1 with 64 is 9374% faster. ✓ summary ➭ observe#2 with 8 is 2.167% faster. ✗ summary ➭ observe#2 with 4 and 2 with 2 is 19.97% slower. ⚠ summary ➭ observe#2 with 2 and 2 with 4 is 1.552% acceptably slower. ⚠ summary ➭ observe#6 with 2 is 0.3165% acceptably slower.
Hi @zbjornson. any news regarding the PR ?
Bumping for interest
Bumping for interest again :)
Bumping for interest as well, this would save my organization so much pain!
bump for intrest. i really need this
bump for interest. really important task
bumping for interest!
@zbjornson @siimon PTAL 🙂
Thanks for all the feedback and all the interest shown! I'll rebase and check the comments as soon as possible (this weekend most likely).
@voltbit is there anything we can do to help you?
hi @shyimo & @ejba & other contributors I am not sure I will get the time to work on this again before the holiday season. If this work is urgent for you and you want to pick up the task feel free to do so.
Hello team, any update? I am waiting this feature
There was literally an update in the post before yours
There was literally an update in the post before yours
I see the PR is draft status. Is it merged to the main branch and publish to a new version?
Superseded by #544
Released in https://github.com/siimon/prom-client/releases/tag/v15.0.0-0