opentelemetry-specification Use short_name to support instrumentation scope in Prometheus

Use short_name to support instrumentation scope in Prometheus

Open dashpole opened this issue 2 years ago • 1 comments

Related: https://github.com/open-telemetry/opentelemetry-specification/issues/1906

This is a second attempt at https://github.com/open-telemetry/opentelemetry-specification/pull/2422.

Depends on https://github.com/open-telemetry/opentelemetry-specification/pull/2702

Changes

Prevent Prometheus naming collisions between instrumentation libraries by adding the short_name as a prefix to prometheus metrics.
Preserve Instrumentation Scope when exporting prometheus metrics by adding a <short_name>_otel_scope_info Prometheus metric.
Fully round-trip Instrumentation Scope through prometheus exporters to the collector by performing the inverse of 1 and 2 in the Prometheus receiver.

If needed, I can trim this PR back to only the first commit. That would accomplish the primary goal of adding a tool to prevent collisions between instrumentation scopes, even if the rest of the information is dropped.

@open-telemetry/wg-prometheus @jmacd @tigrannajaryan @jsuereth

Jul 29 '22 13:07 dashpole

A comment from the prometheus wg: We should consider splitting build_info out of target_info. There is a question about consistency across languages, and what the structure/overlap of those conventions are with otel's conventions.

That is orthogonal to this PR, though.

Aug 03 '22 15:08 dashpole

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Aug 18 '22 03:08 github-actions[bot]

The two ideas from the spec sig today were:

Use the last word of the instrumentation scope name (and also, maybe allow overriding in the exporter w/ a mapping)
Change guidance for the scope name to be shorter and readable. Deal with scope uniqueness separately.

The preference at the meeting seemed to be for the first option, so i'm going to update this PR to propose that.

Aug 30 '22 16:08 dashpole

I've updated the description and proposal to propose using the last word of the instrumentation scope.

Aug 31 '22 19:08 dashpole

Use the last word of the instrumentation scope name (and also, maybe allow overriding in the exporter w/ a mapping)

~I like this. If we allow overriding it in the exporter we have a complete solution. Name clashes are going to be rare and for that rare case requiring overriding in the exporter is fine. It can even be done via an env variable specific to prometheus exporter so that no code changes are necessary. We may want to have some diagnostics to detect the clashes and show in the logs or otherwise surface the problem.~

Edit: based on examples @arminru and @joaopgrassi posted in the comment thread below, I am not so sure anymore that name clashes will be rare. So, maybe it is not such a good idea after all.

Aug 31 '22 21:08 tigrannajaryan

Depending on the language, it may fail to create the Prometheus exporter, or may fail to send some, or all metrics.

@dashpole I have realized that I have no idea why this is a problem, why it may fail to create exporter or fail to send the metrics. Can you please clarify this?

Sep 13 '22 15:09 tigrannajaryan

A prometheus exporter which receives both of these metrics would not be able to serve both of those histograms.

Is this a problem only for histograms or other metric types as well?

Sep 13 '22 17:09 tigrannajaryan

I have realized that I have no idea why this is a problem, why it may fail to create exporter or fail to send the metrics. Can you please clarify this?

@tigrannajaryan Prometheus itself will do some validation on registration (golang implementation), and will do additional validation at collection time to ensure uniqueness of metric points. I'm mostly familiar with the go implementation, but I suspect other languages do similar validation. OpenMetrics requires a unique labelset per-metric-name, and a unique name within metrics on an endpoint. Prometheus has similar requirements for name uniqueness and label-set uniqueness.

Is this a problem only for histograms or other metric types as well?

It is a problem for all metric types.

Sep 13 '22 19:09 dashpole

I've updated the proposal to change the naming guidance for metrics instead of adding any conversion logic. I've also removed the opentelemetry_scope_info metric from this PR, and will propose that separately.

Sep 13 '22 20:09 dashpole

if I have two http client libraries in my application, they would each produce a metric named http.client.duration, but with different meters (e.g. otelmux vs otelhttp). A prometheus exporter which receives both of these metrics would not be able to serve both of those histograms.

@dashpole I don't understand this. What is the problem here? Two different libraries produce datapoints for the metric with the same name (http.client.duration). Why does it matter that the datapoints are produced by different Meters? Can't the exporter ignore the fact that they are coming from different Meters?

For example datapoints produced by 2 libraries:

Library	Metric name	Dimensions	Value
otelmux	http.client.duration	http.url=/foo	1
otelmux	http.client.duration	http.url=/bar	2
otelhttp	http.client.duration	http.url=/baz	3

Is it impossible (or maybe undesirable?) for Prometheus exporter in Otel SDK to turn these datapoints into the following?

http_client_duration{http_url="/foo"} 1
http_client_duration{http_url="/bar"} 2
http_client_duration{http_url="/baz"} 3

I am sure I am missing something, I just want to understand the problem fully :-)

Sep 15 '22 14:09 tigrannajaryan

@tigrannajaryan It would be problematic if they had the same labels. E.g.

http_client_duration_sum{http_url="/foo"} 1
http_client_duration_sum{http_url="/bar"} 2
http_client_duration_sum{http_url="/bar"} 3

Would be invalid.

I also realized after the spec meeting that I was wrong about prometheus disallowing different label keys on a single metric. They initially implemented that requirement in prometheus, but then reversed course and removed the requirement after OpenMetrics decided not to include it: https://github.com/prometheus/client_golang/pull/417

That means the "infeasible alternative" above: Add the namespace as a label to all Prometheus metrics instead of a prefix is actually a viable option, and one that would be less imposing on the instrumentation ecosystem (to @ahayworth's feedback). The only downside of that approach would be handling potential conflicting descriptions/units from collisions, but that seems like an acceptable tradeoff.

Sorry for the frequent changes. I'm hopeful this will be the last round of significant changes.

Sep 15 '22 15:09 dashpole

It would be problematic if they had the same labels

Only if they have the exact same timestamp, right? If the timestamps are different then that's OK, that's just 2 different datapoints. I am not sure what the chances of this happening are. I guess depends on timestamp precision and whether the 2 libraries happen to be installed in such a way that they intercept the same http call in this example. If this does happen then perhaps that's an acceptable situation (i.e. the first to record the datapoint wins)? I assume this won't result in crashes or anything, simply the duplicates will be ignored, which I think is an acceptable limitation.

That means the "infeasible alternative" above: Add the namespace as a label to all Prometheus metrics instead of a prefix is actually a viable option, and one that would be less imposing on the instrumentation ecosystem

Yes. So, essentially the output would look like this:

http_client_duration_sum{http_url="/foo" scope=otelmux} 1
http_client_duration_sum{http_url="/bar" scope=otelmux} 2
http_client_duration_sum{http_url="/bar" scope=otelhttp} 3

We can also use the full Scope name here, I don't think there is a need to shorten it.

The only downside of that approach would be handling potential conflicting descriptions/units from collisions, but that seems like an acceptable tradeoff.

This should not happen if libraries follow the semantic conventions.

One more thing that may happen is that libraries make different choices for what labelsets to use (if semantic convention allows for variation). This should be fine as well, right? If I understand it correctly in Prometheus it is valid to do this:

http_client_duration_sum{http_url="/foo" http_method=put scope=otelmux} 1
http_client_duration_sum{http_url="/bar" http_method=get scope=otelmux} 2
http_client_duration_sum{http_url="/bar" scope=otelhttp} 3

Sep 15 '22 16:09 tigrannajaryan

Only if they have the exact same timestamp, right?

No. Prometheus/OpenMetrics metric points are required to have a unique labelset within the MetricFamily (i.e. metric name + label set must be unique). Prometheus also recommends against adding explicit timestamps unless serving (very) old metrics. This is avoided by the introduction of an additional label.

We can also use the full Scope name here, I don't think there is a need to shorten it.

Correct.

One more thing that may happen is that libraries make different choices for what labelsets to use (if semantic convention allows for variation). This should be fine as well, right?

Yes. Metrics SHOULD (not MUST) have the same label keys: Metrics with the same name for a given MetricFamily SHOULD have the same set of label names in their LabelSet. I think this approach abides by that statement.

Sep 15 '22 16:09 dashpole

We can also use the full Scope name here, I don't think there is a need to shorten it.

Correct.

One more thing that may happen is that libraries make different choices for what labelsets to use (if semantic convention allows for variation). This should be fine as well, right?

Yes. Metrics SHOULD (not MUST) have the same label keys: Metrics with the same name for a given MetricFamily SHOULD have the same set of label names in their LabelSet. I think this approach abides by that statement.

So, if this is the case then we don't need to do any prefixing and don't need a short name or similar, right? We add the scope name as a label, we require all libraries to properly honour semantic conventions such the description, units, data point types are the same regardless of what library emits it and this should be exportable to Prometheus.

Does this solve the problem this PR tries to address or there is more to it?

I wonder if there are any downsides to adding the scope name as a label? Supposedly it is a low cardinality field, so should be fine. Users may not care to see query results broken down by libraries, so they may have to always aggregate to eliminate the scope name in their queries. I don't know how easy is this to do in PromQL, it may or may not be a problem.

Sep 15 '22 18:09 tigrannajaryan

Does this solve the problem this PR tries to address or there is more to it?

This solves it. Nothing more to it.

I wonder if there are any downsides to adding the scope name as a label? Supposedly it is a low cardinality field, so should be fine. Users may not care to see query results broken down by libraries, so they may have to always aggregate to eliminate the scope name in their queries. I don't know how easy is this to do in PromQL, it may or may not be a problem.

Aggregating away labels is easy to do in promql. From these examples: http_requests_total{job="apiserver", handler="/api/comments"} would aggregate away labels other than job and handler.

As for cardinality, it should maintain the overall cardinality of the OTel metrics, since those are already separated by scope.

Downsides I can think of are:

Increased size of the (text) scrape result means higher memory usage for both the application and scraper.
Users likely won't want to often break down metrics by instrumentation scope. The extra labels may be seen as a nuisance.
Metric descriptions can still (theoretically) conflict.
We are intentionally not following Prometheus/OpenMetrics guidance around metric naming: Metric names SHOULD indicate what piece of code they come from.

Sep 15 '22 18:09 dashpole

We are intentionally not following Prometheus/OpenMetrics guidance around metric naming: Metric names SHOULD indicate what piece of code they come from.

This is probably out of topic, bit I am not sure this is a good recommendation. The ability to aggregate http request time across libraries seems very useful to me. If I want to know the response time of a particular destination, I do not necessarily care which source is making request (what application or what library). Recommending that metrics are different for each source makes aggregation impossible/difficult.

This guidance also seems to be contradicting what Prometheus recommends to make a label:

as a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful 
(though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, 
having the capacity of various queues in one metric is good

This seems to be one of the cases when aggregation is meaningful.

Sep 15 '22 20:09 tigrannajaryan

This is probably out of topic, bit I am not sure this is a good recommendation. The ability to aggregate http request time across libraries seems very useful to me.

Prometheus doesn't have semantic conventions to align instrumentation in that way. Prometheus assumes users often would name things the same even when they are different things. I think we would be making a reasonable decision to not add prefixes given our conventions. I agree with the utility of being able to aggregate across sources.

The prometheus guidance on the total being meaningful just tells you when you should split a metric, not when you should combine. It is intended to prevent someone from adding, for example, a memory metric with user, system, and total dimensions, as the sum across labels wouldn't be meaningful. From the OM Spec: There are situations in which both using multiple Metrics within a MetricFamily or multiple MetricFamilies seem to make sense.. It then gives an example of success vs failure being a case where separate metrics is better than a single one with labels.

Sep 15 '22 20:09 dashpole

Hi @dashpole Giving the latest (great) discussions and the last update you did, I think a rename on the title of the PR is in place? Right now it's not aligned with the chosen approach.

Sep 20 '22 08:09 joaopgrassi

Looks good! Last thing: Not sure if we need a changelog entry for this case?

I think we do.

Sep 22 '22 13:09 tigrannajaryan

Added a changelog

Sep 22 '22 14:09 dashpole

@dashpole I anticipate users might react to the new metric attributes in their Prometheus data and come to request an option to disable these attributes, in exchange for the risk of exporter errors caused by duplicates. Do you see any potential problems with offering such an option?

Sep 26 '22 18:09 jmacd

@dashpole I don't see a problem with that, and think it is a good idea. I added that to this change.

Sep 26 '22 20:09 dashpole

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Oct 07 '22 03:10 github-actions[bot]

@ahayworth Hey, could you take a look again? You had requested changes on this PR, but it has changed ever since ;)

Oct 11 '22 14:10 carlosalberto

I believe @ahayworth's feedback, which was that we should not recommend adding a prefix to all metrics, has been addressed.

I'm happy with the current state, and would like to see this merged unless others have feedback.

Oct 11 '22 15:10 dashpole

@open-telemetry/specs-metrics-approvers @open-telemetry/technical-committee I will merge this tomorrow unless I see objections.

Oct 11 '22 18:10 tigrannajaryan

@carlosalberto Thank you for poking me: I agree with @dashpole that my feedback has indeed been addressed. ❤️

Oct 11 '22 19:10 ahayworth

Changelog updated. No new changes for several days. No objections from anyone. Merging.

Oct 12 '22 16:10 tigrannajaryan

opentelemetry-specification opentelemetry-specification copied to clipboard

Use short_name to support instrumentation scope in Prometheus

Changes

opentelemetry-specification
opentelemetry-specification copied to clipboard