opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Stabilize Overflow attribute section under Cardinality Limits

Open utpilla opened this issue 11 months ago • 35 comments

What are you trying to achieve?

Mark the Overflow attribute section under Cardinality limits as Stable.

The spec was updated with the overflow attribute and cardinality limits section in #2960.

It looks like we already have more than three languages that have added support for this:

  • Go https://github.com/open-telemetry/opentelemetry-go/pull/4457
  • C# https://github.com/open-telemetry/opentelemetry-dotnet/pull/4737
  • Java https://github.com/open-telemetry/opentelemetry-java/pull/5560
  • Rust https://github.com/open-telemetry/opentelemetry-rust/pull/1066

utpilla avatar Feb 26 '24 22:02 utpilla

https://github.com/open-telemetry/opentelemetry-specification/issues/3798 This seems like a pre-req.

cijothomas avatar Feb 26 '24 22:02 cijothomas

Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete?

cijothomas avatar Feb 26 '24 23:02 cijothomas

Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete?

@cijothomas I believe you're referring to the Configuration section under Cardinality limits. This issue only targets the Overflow attribute section.

I think as long as the SDKs have imposed some cardinality limit (be it hardcoded or configurable), the stability of metric overflow attribute can be tracked independently.

utpilla avatar Feb 26 '24 23:02 utpilla

Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete?

@cijothomas I believe you're referring to the Configuration section under Cardinality limits. This issue only targets the Overflow attribute section.

I think as long as the SDKs have imposed some cardinality limit (be it hardcoded or configurable), the stability of metric overflow attribute can be tracked independently.

Oh this was just for that subsection, got it.

What is the benefit of stabilizing that part alone?

cijothomas avatar Feb 27 '24 00:02 cijothomas

What is the benefit of stabilizing that part alone?

Even as a stand-alone feature, this is very useful. Here are two major benefits of using this feature:

  • It retains the correct gross summaries (such as Sum, Count, etc.) for every metric which would have otherwise been lost when the application exceeds the cardinality limit set by the SDK. This ensures that for any metric, data aggregated across all dimension combinations is always reliable.
  • It helps the user calibrate the cardinality limit to their needs. The presence of this attribute in the exported data is an indication that the application has exceeded the cardinality limit. Users can use this information to fine tune the cardinality limit when configuring the SDK.

utpilla avatar Feb 27 '24 00:02 utpilla

@utpilla I agree with your reasoning -- there can be disagreement over how the behavior is implemented, but the attribute meaning is well defined either way, I think, and I agree that we can stabilize it.

jmacd avatar Feb 27 '24 20:02 jmacd

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR. Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result. Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)

  2. Provide an end-to-end story (how to use Prom/Grafana) @utpilla I will reach out to you offline.

reyang avatar Feb 28 '24 17:02 reyang

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR. Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result. Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)

FYI @cijothomas @jack-berg @jmacd @jsuereth - I sent a PR #3912 trying to address 1).

reyang avatar Feb 28 '24 20:02 reyang

the SDK can choose to support more (e.g. one more than the limit)

what does it mean to support more than one limit? thanks

trask avatar Feb 28 '24 22:02 trask

the SDK can choose to support more (e.g. one more than the limit)

what does it mean to support more than one limit? thanks

It is not about more than one limit, it is about "the SDK can offer more room if there is a good reason", for example, the user can set the limit to 10 and the SDK can decide to offer 12 (or whatever number as long as it is greater than or equal to 10).

reyang avatar Feb 28 '24 23:02 reyang

if there is a good reason

do we have a reason in mind already? it may be worth mentioning it in the PR, since there's a downside to not respecting the users requested limit (it could be a confusing user experience if they set the cardinality limit to one value but they see higher cardinality being reported to their backend)

trask avatar Feb 29 '24 00:02 trask

if there is a good reason

do we have a reason in mind already? it may be worth mentioning it in the PR, since there's a downside to not respecting the users requested limit (it could be a confusing user experience if they set the cardinality limit to one value but they see higher cardinality being reported to their backend)

I can share some of the thinking here (the PR description already links to this issue):

  1. SDK cardinality limit is more focusing on preventing indefinite amount of memory and DDOS attack.
  2. In most cases, the user should not expect to use the SDK cardinality limit as the upper limit for their backend. The SDK cardinality limit is a limit which governs the metrics points during a collection cycle, across multiple collection cycles there could be higher cardinality (e.g. in DELTA Aggregation Temporality mode, one collection cycle could give totally different attribute sets than previous collection cycles).
  3. For high perf scenario, certain implementations (e.g. Rust) might want to reserve some metric points for special cases (e.g. when a measurement is being reported without any attribute key-value pair), so the output might have +1 or +2 situation.
  4. For high perf scenario, certain implementations (e.g. C/C++) might want to allocate 2^N metric points for faster lookup / memory manipulation.

And more importantly, removing the "is one less than the limit, as a result" wording makes it easier for the user. Imagine the users just want to report a value without any attribute, they have to understand and set the limit to 2 in order to get the correct result (the aggregated value, without any attribute). If they set the limit to 1, they will always end up getting the overflow point.

reyang avatar Feb 29 '24 03:02 reyang

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR. Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result. Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)
  2. Provide an end-to-end story (how to use Prom/Grafana) @utpilla I will reach out to you offline.
  1. is done via #3912.
  2. is in progress per my offline discussion with @utpilla.

reyang avatar Mar 01 '24 20:03 reyang

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR. Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result. Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)
  2. Provide an end-to-end story (how to use Prom/Grafana) @utpilla I will reach out to you offline.

@jack-berg @jmacd @jsuereth @reyang @trask

For the second point, please check this README file. I have put up some text explaining how the overflow attribute plays out with the Prometheus/Grafana workflow and how it adds value for the users. Let me know if you have any questions or suggestions.

utpilla avatar Mar 02 '24 02:03 utpilla

Since this is a stabilization tracking issue, there are a few open issues related to cardinality limits:

  • https://github.com/open-telemetry/opentelemetry-specification/issues/3866
  • https://github.com/open-telemetry/opentelemetry-specification/issues/3813
  • https://github.com/open-telemetry/opentelemetry-specification/issues/3798
  • https://github.com/open-telemetry/opentelemetry-specification/issues/3578

#3578 in particular seems important to resolve before marking this stable.

dashpole avatar Mar 04 '24 14:03 dashpole

I saw the TC notes. Is there any particular feedback related to Prometheus we want?

The current spec has drawbacks for prometheus users, but these likely apply to non-prometheus users as well:

  • Overflow doesn't appear in queries when summing by labels, which is very common
  • ~It isn't possible to make a generic alert for overflow for any metric. You would need to make one alert per-metric.~
  • Overflow metrics will satisfy negative filter statements (e.g. status_code != 200), so it may result in strange error rates when overflow occurs.

@fstab and @gouthamve have provided a helpful feedback in the past, and may know scenarios I haven't thought of, or have suggestions for how to handle cardinality overflow.

dashpole avatar Mar 04 '24 18:03 dashpole

It isn't possible to make a generic alert for overflow for any metric. You would need to make one alert per-metric.

Can you elaborate on this? It seems possible to write a prometheus query that searches across all metric names with a particular label. Is it prohibitively expensive to run this query for alerting purposes? Or something else?

jack-berg avatar Mar 13 '24 15:03 jack-berg

Thanks @jack-berg. You are correct, and I didn't realize that was possible.

dashpole avatar Mar 13 '24 15:03 dashpole

I was assuming this was prohibitively expensive. If not, I think we have a way forward here for stabilzation.

jsuereth avatar Mar 13 '24 15:03 jsuereth

I'm not sure if it would be prohibitively expensive or not.

dashpole avatar Mar 13 '24 16:03 dashpole

Sorry I have not studied the complete back-story, but an idea came to me:

In this text:

An overflow attribute set is defined, containing a single attribute otel.metric.overflow having (boolean) value true, which is used to report a synthetic aggregation of the Measurements that could not be independently aggregated because of the limit.

if the word "single" is deleted, the other attributes could be replaced with an overflow value and Prometheus queries/joins/etc would now work.

In other words, instead of:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true"} 100

Prometheus would see:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true" a="overflow", b="overflow"} 100

bboreham avatar Mar 14 '24 08:03 bboreham

Yup, querying across all metric names can be tricky/impossible/expensive for some backends, but Prometheus should be generally fine if queried correctly (e.g. just {otel.metric.overflow="true"}).

However, what @bboreham figured out would be small but amazing modification. It should mitigate the big challenges with the overflow logic @dashpole mentioned:

  • Overflow doesn't appear in queries when summing by labels, which is very common
  • Overflow metrics will satisfy negative filter statements (e.g. status_code != 200), so it may result in strange error rates when overflow occurs.

The slight disadvantage is question what to do if the target has multiple set of different label names on single metric, but perhaps including all labels (dimensions) that overflows on single overflow series might do the trick.

Also "..." could be a great overflow label value here as well 🤔 mycounter{otel.metric.overflow="true" a="...", b="..."} 100

bwplotka avatar Mar 14 '24 09:03 bwplotka

Please do not invent new conventions that do not follow Prometheus best practices.

In the event of a data error in a collector, best practice is to fail the whole collection with a 5xx error.

  • https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
  • https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

This follows the "Fail Fast" best practice.

SuperQ avatar Mar 14 '24 11:03 SuperQ

@SuperQ I believe what happens today in Prometheus (with client lirbaries) is the process itself crashes (eventually) on high cardanility. OTEL deemed this unacceptable, which is why we suppress/limit cardinality client side.

If you have links to other ways prometheus deals with things client-side happy to follow the best practice. But I would not say there is one today.

jsuereth avatar Mar 14 '24 18:03 jsuereth

@bboreham @bwplotka I do like what you're suggesting with setting the value of labels to "oveflow". It's not entirely feasible in OTEL because we don't have string labels, we have typed attributes (so a label could be an int, e.g.). I think this is why we just add the new label here. I do really like your suggestion though.

In this case, I think our assumption is that in prometheus the label will have an empty value (e..g overflow="true", a="", b=""). We assume (via best practice) the a/b will have real values in most cases, so an empty label will already stand out as much as a "..." would.

@jack-berg / @reyang do you remember all the rationale we had when we decided on just one new label? Is it just the typed-attribute issue?

jsuereth avatar Mar 14 '24 18:03 jsuereth

One option would be for prometheus exporters to fail the scrape if the overflow attribute is set on any metrics. But unlike with the prometheus server's scrape limits, you wouldn't be able to mitigate it by changing a server-side config. You would need to restart your client with a higher overflow limit.

An alternative would be to recommend to users that they set their scrape limits on the prometheus server lower than the client-side limits so they still get the scrape failures, but also have the ability to mitigate via server config.

dashpole avatar Mar 14 '24 18:03 dashpole

@dashpole, yup, exactly my thoughts. If you handle sample limits in the library you want to fail the scrape. But as you point out, it's now the client's issue to deal with it. But that sounds like it's working as intended.

In Prometheus, we leave it up to users to decide what cardinality they want from the client libraries. Like you say, Prometheus can handle cardinality limits on the server side with admin configured scrape limits.

In practice we've found users tend to notice on the server side long before it becomes a client-side issue. Since the monitoring system has to deal with the total cardinality of all instances, it tends to show up there first.

The Prometheus client libraries use very few bytes of memory per metric exposed. I can get some benchmark tests of this if it helps.

SuperQ avatar Mar 14 '24 19:03 SuperQ

@dashpole This doesn't fix the fact that they will consume memory storing metrics client side.

jsuereth avatar Mar 14 '24 19:03 jsuereth

@jsuereth sorry for the confusion. I'm not suggesting we remove client-side limits, and I think those are useful. I meant to say that we could keep the current client-side limits and attribute, and tell prometheus users to set scrape limits below our default value as a way to get the prometheus-recommended behavior. That way, we would not need to introduce scrape failures in our exporter, which could cause the problem described above for users.

@SuperQ I trust your assertiib regarding low memory usage of prometheus client libraries. I would read this proposal as adding an additional layer of protection for users to prevent DoS, not as a replacement for server-specified limits. I agree that Prometheus users will generally notice cardinality problems in the server before the client, assuming they are scraping the endpoint.

dashpole avatar Mar 14 '24 19:03 dashpole

Nerd-sniped myself into writing a quick test. I generated 100k and 1M tests with metrics like test_cardinality_total{test_label="zzy9RevCcI6lnBy4gzX02Antnt9ur9Ux"}.

  • 100k = 15MiB = ~157 bytes/metric pprof.me
  • 1M = 198MiB = ~207 bytes/metric pprof.me

Code:

package main

import (
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"github.com/thanhpk/randstr"
)

func main() {
	fmt.Println("Generating metrics")
	metricTest := promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "test_cardinality_total",
			Help: "Test for cardinality",
		},
		[]string{"test_label"},
	)

	for i := 0; i < 100_000; i++ {
		label := randstr.String(32)
		fmt.Printf("Generating metric %d %s\n", i, label)
		metricTest.WithLabelValues(label).Inc()
	}

	http.Handle("/metrics", promhttp.Handler())
	fmt.Println("Listening on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

SuperQ avatar Mar 14 '24 20:03 SuperQ