Review OpenMetrics spec

Open cyberbit opened this issue 2 months ago • 1 comments

Raw spec here: https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md. Intend to review to determine if a compatibility layer is desirable, and if it can be implemented, and/or potentially targeted as a schema overhaul for a 1.0 release.

Oct 21 '25 19:10 cyberbit

Jotting some thoughts down as I read through this.

Implementers MUST expose metrics in the OpenMetrics text format in response to a simple HTTP GET request to a documented URL for a given process or device. This endpoint SHOULD be called "/metrics". Implementers MAY also expose OpenMetrics formatted metrics in other ways, such as by regularly pushing metric sets to an operator-configured endpoint over HTTP.

The specific language here of MUST and MAY (see RFC 2119) implies that exposing metrics via HTTP GET /metrics is an "absolute requirement". This isn't possible in CC alone, but if an adapter implemented the optional part of that spec (pushing to a platform endpoint), and the platform exposes a metrics URL, it could pass this spec.

Metric values in OpenMetrics MUST be either floating points or integers. Note that ingestors of the format MAY only support float64. The non-real values NaN, +Inf and -Inf MUST be supported. NaN MUST NOT be considered a missing value, but it MAY be used to signal a division by zero.

There is an internal representation of NaN in the chart outputs, specifically for missing values (lol). I hadn't considered non-real values to be meaningful as measurements but I think this makes sense. I wonder how implementors are intended to handle missing values, then... (Edit: Later in the spec: "There are valid cases when data stops being present. For example a filesystem can be unmounted and thus its Gauge Metric for free disk space no longer exists. There is no special marker or signal for this situation. Subsequent expositions simply do not include this Metric.")

Timestamps MUST be Unix Epoch in seconds. Negative timestamps MAY be used.

I have so many questions.

The name of a MetricFamily MUST NOT result in a potential clash for sample metric names as per the ABNF with another MetricFamily in the Text Format within a MetricSet. An example would be a gauge called "foo_created" as a counter called "foo" could create a "foo_created" in the text format.

This is a potential issue with middleware or custom inputs that add metrics. Should there be proactive checks in place to detect these issues? The language implies that this state should be an error, unless the router adjusts these to something like "foo_created_2" (yuck), or silently drops them (double yuck).

Info metrics are used to expose textual information which SHOULD NOT change during process lifetime. Common examples are an application's version, revision control commit, and the version of a compiler.

One of the foundational principles of Telem was that metrics are numbers, not text. At the time of writing this, my understanding of OpenMetrics as a whole is small, but if Info Metrics are distinguishable from normal metrics in a meaningful way, I will consider supporting them. IMO, supporting labels gets most of the way there already.

...later...

The Sample MetricName for the value of a MetricPoint for a MetricFamily of type Info MUST have the suffix "_info". The Sample value MUST always be 1.

Ah, Info metrics are just 1 always and the text is in permanent labels. I can live with that.

Exposers should leave any math or calculation up to ingestors. (...) As an example, you should not expose a gauge with the average rate of increase of a counter over the last 5 minutes. Letting the ingestor calculate the increase over the data points they have consumed across expositions has better mathematical properties and is more resilient to scrape failures.

This makes a lot of sense, but also means middleware is in a tough place. From an observability perspective, Telem is both an exposer (SecureModem, Grafana) and an ingestor (graphical outputs). In the very long term, I intend to split the exposer and ingestor roles into separate tools. When this is being built out, middleware would likely move to the ingestor side, as aggregates like average, delta, etc. are primarily useful from an analytic/graphical perspective.

Nov 04 '25 02:11 cyberbit