prom-client icon indicating copy to clipboard operation
prom-client copied to clipboard

Add openmetrics and exemplars support

Open voltbit opened this issue 4 years ago • 14 comments

example

Overview

  • Added support for the OpenMetrics standard
  • Added support for automatically providing exemplars populated with trace information from OpenTelemetry

The main purpose of the PR is to enable the use of Exemplars for NodeJS codebase.

Relevant resources:

Grafana exemplars Prometheus exemplars OpenMetics spec Prometheus format

Design

The OpenTelemetry standard is very close to the original Prometheus format. In order to keep the library as backwards compatible as possible the default format is kept unchanged (Prometheus) and the use of exemplars is disabled.

The new features can be toggled:

  • The format at the registry level (prometheus/openmetrics)
  • The exemplars at the metric level

Each registry instance has an attribute (contentType) that will decide the format. The two possible formats are defined by the constants OPENMETRICS_CONTENT_TYPE and PROMETHEUS_CONTENT_TYPE which contain the HTTP content type. Future versions should default to the 1.0.0 version.

Each metric has a flag for enabling the exemplar, the flag is put on the metrics supeclass for simplicity, but out of the currently implemented metric types only histograms and counters can have exemplars.

The biggest change to the code is the creation of separate functions for Counter increment and Histogram observe. Because the functions need to support a third optional parameter (exemplar labels) I have changed the way parameters are passed to the functions. Instead of using plain (label, value) the users will need to provide a single object with the format ({labels, value, exemplarLabels}). The change should not impact existing users, but users who want to use exemplars will need to use the new call format.

Exemplar object

Timestamp - is the time when the exemplar was created Reference from the Golang client: https://github.com/prometheus/client_golang/blob/1b145cad6847a692bd07e872d64b7102d33213c6/prometheus/histogram.go#L432.

There is a hard 128 UTF-8 character limit on exemplar length.

The labels use for out of the box traces are traceId and spanId, it feels more like JavaScript to me, there is no other reason for the name choice. The golang implementation seems to be using traceID here and the Java impl. uses trace_id here. The label used for exemplars can be changed in Grafana.

Counters in OpenMetrics

Counters have a brekaing change in the form of an enforced _total suffix, it is not just a convention anymore. Examples:

Prometheus

# HELP mycounter help
# TYPE mycounter counter
mycounter 0

# HELP mycounter2_total help
# TYPE mycounter2_total counter
mycounter2_total 0

OpenMetrics

# HELP mycounter help
# TYPE mycounter counter
mycounter_total 0

Prometheus ignores the comments related to name and type, but the name of the metrics changes too and has the potential to break dashboards/alerts etc. The current implementation follows the same approach as the Java implementation here:

https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Counter.java#L72-L108

However, instead of applying the suffix at the level of the Counter object, this implementation applyes the change in the Registry object. The disadvantage is that the code is less elegant. The advantage is that the change is not breaking in any way for the existing users - the _total suffix will only be enforced by OpenMetrics registries, not Prometheus registries.

In the future, when OpenMetrics becomes more widely adopted, the behaviour can be moved inside Counter object and made mandatory.


Benchmarks

Benchmark tests were not changed. They are using the default registry type (Prometheus) and no exemplars, so it is a check to see the impact for current users of the library. Ran 4 tests (results in gists bellow). The highest impact was on the registry benchmark with a ~10-15% performance hit.

⚠ registry ➭ getMetricsAsJSON#1 with 64 is 5.345% acceptably slower.
⚠ registry ➭ getMetricsAsJSON#2 with 8 is 3.076% acceptably slower.
⚠ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 4.063% acceptably slower.
✓ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 0.1468% faster.
⚠ registry ➭ getMetricsAsJSON#6 with 2 is 4.137% acceptably slower.
✗ registry ➭ metrics#1 with 64 is 11.50% slower.
✓ registry ➭ metrics#2 with 8 is 0.2174% faster.
⚠ registry ➭ metrics#2 with 4 and 2 with 2 is 1.046% acceptably slower.
⚠ registry ➭ metrics#2 with 2 and 2 with 4 is 3.413% acceptably slower.
✗ registry ➭ metrics#6 with 2 is 15.03% slower.
⚠ histogram ➭ observe#1 with 64 is 0.2735% acceptably slower.
⚠ histogram ➭ observe#2 with 8 is 0.5325% acceptably slower.
⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 0.09087% acceptably slower.
⚠ histogram ➭ observe#2 with 2 and 2 with 4 is 0.3243% acceptably slower.
⚠ histogram ➭ observe#6 with 2 is 0.6302% acceptably slower.
✓ gauge ➭ inc is 16.96% faster.
⚠ gauge ➭ inc with labels is 1.991% acceptably slower.
⚠ summary ➭ observe#1 with 64 is 2.261% acceptably slower.
✓ summary ➭ observe#2 with 8 is 2.166% faster.
✓ summary ➭ observe#2 with 4 and 2 with 2 is 0.4552% faster.
⚠ summary ➭ observe#2 with 2 and 2 with 4 is 1.407% acceptably slower.
⚠ summary ➭ observe#6 with 2 is 1.116% acceptably slower.

https://gist.github.com/voltbit/55bfdafccb5a0458d0b2aff9703dae43 https://gist.github.com/voltbit/1e1097e6400638334e11da52fefcd5d4 https://gist.github.com/voltbit/41828df848a7132c1aad196414ea2d69 https://gist.github.com/voltbit/70539929453a2b95d4f9ac2df6707a9b

TODO

  • [x] Complete test coverage of the new features
  • [x] Performance tests
  • [x] Add better examples and readme info
  • [x] ~~Strategy for registry merge when there are different registry formats~~
    • decided to consider the merge of two different types of registries undefined behaviour, the users should always use the same type - Prometheus or OpenMetrics - if merging
  • [x] Hadling the _total suffix on counters

Not implemented

  • Support for the _created suffix on any metrics

voltbit avatar Nov 26 '21 08:11 voltbit

Very exciting, thanks for working on this!

SimenB avatar Nov 26 '21 10:11 SimenB

Hi @zbjornson could you please trigger the tests again? The PR should be ready for review :).

voltbit avatar Jan 13 '22 11:01 voltbit

CI is green! I'll try to review this this weekend and hopefully @SimenB and/or @siimon can also review soon.

zbjornson avatar Jan 13 '22 16:01 zbjornson

thanks @voltbit & @zbjornson! we are excited about this feature!

shyimo avatar Jan 23 '22 08:01 shyimo

Hi @zbjornson. Any news on that ?

shyimo avatar Feb 23 '22 10:02 shyimo

Hi guys. Are there, by chance, any updates on how this is progressing and an estimate, hopefully?

dnutels avatar Apr 24 '22 08:04 dnutels

Hi guys. Are there, by chance, any updates on how this is progressing and an estimate, hopefully?

Hi @dnutels I will start working again on this in a couple of weeks, I will implement the changes requested around mid May, but I cant work on it earlier.

voltbit avatar Apr 24 '22 12:04 voltbit

New runs for benchmarks with the latest changes.

Expand for benchmark results

Summary:

✗ registry ➭ getMetricsAsJSON#1 with 64 is 11.43% slower. ✗ registry ➭ getMetricsAsJSON#2 with 8 is 155.0% slower. ✓ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 58.48% faster. ✗ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 23.80% slower. ⚠ registry ➭ getMetricsAsJSON#6 with 2 is 6.297% acceptably slower. ⚠ registry ➭ metrics#1 with 64 is 5.984% acceptably slower. ⚠ registry ➭ metrics#2 with 8 is 1.545% acceptably slower. ⚠ registry ➭ metrics#2 with 4 and 2 with 2 is 0.09495% acceptably slower. ✓ registry ➭ metrics#2 with 2 and 2 with 4 is 35.38% faster. ⚠ registry ➭ metrics#6 with 2 is 3.828% acceptably slower. ✓ histogram ➭ observe#1 with 64 is 0.2496% faster. ⚠ histogram ➭ observe#2 with 8 is 0.8472% acceptably slower. ⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 0.7701% acceptably slower. ✓ histogram ➭ observe#2 with 2 and 2 with 4 is 3.590% faster. ⚠ histogram ➭ observe#6 with 2 is 3.784% acceptably slower. ✓ gauge ➭ inc is 2.087% faster. ✓ gauge ➭ inc with labels is 0.2869% faster. ✓ summary ➭ observe#1 with 64 is 3.924% faster. ✓ summary ➭ observe#2 with 8 is 0.2697% faster. ✓ summary ➭ observe#2 with 4 and 2 with 2 is 1.130% faster. ✓ summary ➭ observe#2 with 2 and 2 with 4 is 0.4797% faster. ✓ summary ➭ observe#6 with 2 is 1.389% faster.

Summary:

✓ registry ➭ getMetricsAsJSON#1 with 64 is 1.990% faster. ✓ registry ➭ getMetricsAsJSON#2 with 8 is 321.0% faster. ⚠ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 8.586% acceptably slower. ✓ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 13.94% faster. ✗ registry ➭ getMetricsAsJSON#6 with 2 is 16.18% slower. ✓ registry ➭ metrics#1 with 64 is 20.45% faster. ⚠ registry ➭ metrics#2 with 8 is 0.8526% acceptably slower. ✓ registry ➭ metrics#2 with 4 and 2 with 2 is 5.139% faster. ⚠ registry ➭ metrics#2 with 2 and 2 with 4 is 6.645% acceptably slower. ✓ registry ➭ metrics#6 with 2 is 0.7693% faster. ⚠ histogram ➭ observe#1 with 64 is 3.102% acceptably slower. ✓ histogram ➭ observe#2 with 8 is 1.538% faster. ✓ histogram ➭ observe#2 with 4 and 2 with 2 is 0.4875% faster. ✓ histogram ➭ observe#2 with 2 and 2 with 4 is 0.7254% faster. ✓ histogram ➭ observe#6 with 2 is 2.165% faster. ✓ gauge ➭ inc is 22.42% faster. ⚠ gauge ➭ inc with labels is 1.116% acceptably slower. ✓ summary ➭ observe#1 with 64 is 2.565% faster. ⚠ summary ➭ observe#2 with 8 is 0.2203% acceptably slower. ⚠ summary ➭ observe#2 with 4 and 2 with 2 is 0.08091% acceptably slower. ✓ summary ➭ observe#2 with 2 and 2 with 4 is 1.113% faster. ✓ summary ➭ observe#6 with 2 is 1.459% faster.

Summary:

✗ registry ➭ getMetricsAsJSON#1 with 64 is 12.81% slower. ✗ registry ➭ getMetricsAsJSON#2 with 8 is 563.2% slower. ✓ registry ➭ getMetricsAsJSON#2 with 4 and 2 with 2 is 7.191% faster. ⚠ registry ➭ getMetricsAsJSON#2 with 2 and 2 with 4 is 8.622% acceptably slower. ✗ registry ➭ getMetricsAsJSON#6 with 2 is 13.17% slower. ✗ registry ➭ metrics#1 with 64 is 25.55% slower. ⚠ registry ➭ metrics#2 with 8 is 2.849% acceptably slower. ⚠ registry ➭ metrics#2 with 4 and 2 with 2 is 3.033% acceptably slower. ⚠ registry ➭ metrics#2 with 2 and 2 with 4 is 2.256% acceptably slower. ⚠ registry ➭ metrics#6 with 2 is 4.078% acceptably slower. ⚠ histogram ➭ observe#1 with 64 is 1.686% acceptably slower. ✓ histogram ➭ observe#2 with 8 is 0.3382% faster. ⚠ histogram ➭ observe#2 with 4 and 2 with 2 is 0.2078% acceptably slower. ⚠ histogram ➭ observe#2 with 2 and 2 with 4 is 6.005% acceptably slower. ⚠ histogram ➭ observe#6 with 2 is 2.634% acceptably slower. ✓ gauge ➭ inc is 12.39% faster. ✓ gauge ➭ inc with labels is 2.485% faster. ✓ summary ➭ observe#1 with 64 is 9374% faster. ✓ summary ➭ observe#2 with 8 is 2.167% faster. ✗ summary ➭ observe#2 with 4 and 2 with 2 is 19.97% slower. ⚠ summary ➭ observe#2 with 2 and 2 with 4 is 1.552% acceptably slower. ⚠ summary ➭ observe#6 with 2 is 0.3165% acceptably slower.

voltbit avatar May 18 '22 09:05 voltbit

Hi @zbjornson. any news regarding the PR ?

shyimo avatar Jun 07 '22 07:06 shyimo

Bumping for interest

skyf0xx avatar Jun 22 '22 03:06 skyf0xx

Bumping for interest again :)

vjsamuel avatar Jul 19 '22 20:07 vjsamuel

Bumping for interest as well, this would save my organization so much pain!

isaac-elvt avatar Aug 15 '22 22:08 isaac-elvt

bump for intrest. i really need this

xal3xhx avatar Oct 01 '22 17:10 xal3xhx

bump for interest. really important task

shyimo avatar Oct 12 '22 11:10 shyimo

bumping for interest!

ejba avatar Oct 27 '22 10:10 ejba

@zbjornson @siimon PTAL 🙂

SimenB avatar Oct 27 '22 10:10 SimenB

Thanks for all the feedback and all the interest shown! I'll rebase and check the comments as soon as possible (this weekend most likely).

voltbit avatar Oct 27 '22 11:10 voltbit

@voltbit is there anything we can do to help you?

ejba avatar Oct 31 '22 18:10 ejba

hi @shyimo & @ejba & other contributors I am not sure I will get the time to work on this again before the holiday season. If this work is urgent for you and you want to pick up the task feel free to do so.

voltbit avatar Nov 14 '22 11:11 voltbit

Hello team, any update? I am waiting this feature

vothanhbinhlt avatar Nov 23 '22 04:11 vothanhbinhlt

There was literally an update in the post before yours

SimenB avatar Nov 23 '22 09:11 SimenB

There was literally an update in the post before yours

I see the PR is draft status. Is it merged to the main branch and publish to a new version?

vothanhbinhlt avatar Nov 24 '22 03:11 vothanhbinhlt

Superseded by #544

SimenB avatar Mar 06 '23 14:03 SimenB

Released in https://github.com/siimon/prom-client/releases/tag/v15.0.0-0

SimenB avatar Mar 09 '23 12:03 SimenB