opentelemetry-go Performance vs. Prometheus SDK

trafficstars

Description

Jaeger is in the process of migrating away from Prometheus SDK towards OTEL SDK. We're currently blocked by a massive performance degradation, as illustrated by this benchmark https://github.com/jaegertracing/jaeger/pull/5676. Are we not using OTEL SDK correctly? We're seeing 10-25x slowdown compared to Prometheus SDK.

$ go test -benchmem -benchtime=2s -bench=Benchmark ./internal/metrics/
BenchmarkPrometheusCounter-10       	342003924	         6.984 ns/op	       0 B/op	       0 allocs/op
BenchmarkOTELCounter-10             	33299455	        71.73 ns/op	       0 B/op	       0 allocs/op
BenchmarkOTELCounterWithLabel-10    	12442818	       190.6 ns/op	      16 B/op	       1 allocs/op

Environment

OS: macOS
Architecture: arm64
Go Version: 1.22
opentelemetry-go version: 1.27.0

Steps To Reproduce

https://github.com/jaegertracing/jaeger/pull/5676

Expected behavior

Expecting to see counter bumps to be in the ballpark with Prometheus counters.

Jun 25 '24 00:06 yurishkuro

For me it looks like a correct usage of OTel Metrics SDK.

Jun 25 '24 12:06 pellared

To summarize my findings in https://github.com/open-telemetry/opentelemetry-go/pull/5544.

We should take a close look at the exemplar reservoir performance when exemplars are disabled. It currently makes up a substantial (~50%) portion of the overhead for the no-attributes case.
OTel could potentially consider bound instruments if we want to achieve performance similar to the prometheus client. Bound instruments would potentially eliminate ~95% of the current overhead (excluding overhead from exemplar recording) for the with-attributes case.
For a very small improvement, we could consider optimizing the code which increments our counters values similar to what prometheus has done. This would be more noticeable if we implemented other the optimizations above.

Jun 25 '24 17:06 dashpole

Bound instruments OTEP https://github.com/open-telemetry/oteps/blob/main/text/metrics/0070-metric-bound-instrument.md

Jun 25 '24 17:06 yurishkuro

We should take a close look at the exemplar reservoir performance when exemplars are disabled. It currently makes up a substantial (~50%) portion of the overhead for the no-attributes case.

This appears to be because of the time.Now() call for each measurement. We should at least consider moving the time.Now call into the exemplar reservoir so that it is only invoked when we are actually recording an exemplar.

Jun 25 '24 19:06 dashpole

I also found that the benchmark did not change if I swapped out the OTel prometheus exporter with a manual reader (which is expected). I'm removing the prometheus exporter label.

Jun 25 '24 19:06 dashpole

https://github.com/open-telemetry/opentelemetry-go/pull/5545 is a ~45% performance improvement for the zero-attributes case, and a ~20% performance improvement for the single-attribute case.

Jun 25 '24 21:06 dashpole

opentelemetry-go opentelemetry-go copied to clipboard

Performance vs. Prometheus SDK

Description

Environment

Steps To Reproduce

Expected behavior

opentelemetry-go
opentelemetry-go copied to clipboard