opentelemetry-go
opentelemetry-go copied to clipboard
Performance vs. Prometheus SDK
Description
Jaeger is in the process of migrating away from Prometheus SDK towards OTEL SDK. We're currently blocked by a massive performance degradation, as illustrated by this benchmark https://github.com/jaegertracing/jaeger/pull/5676. Are we not using OTEL SDK correctly? We're seeing 10-25x slowdown compared to Prometheus SDK.
$ go test -benchmem -benchtime=2s -bench=Benchmark ./internal/metrics/
BenchmarkPrometheusCounter-10 342003924 6.984 ns/op 0 B/op 0 allocs/op
BenchmarkOTELCounter-10 33299455 71.73 ns/op 0 B/op 0 allocs/op
BenchmarkOTELCounterWithLabel-10 12442818 190.6 ns/op 16 B/op 1 allocs/op
Environment
- OS: macOS
- Architecture: arm64
- Go Version: 1.22
- opentelemetry-go version: 1.27.0
Steps To Reproduce
https://github.com/jaegertracing/jaeger/pull/5676
Expected behavior
Expecting to see counter bumps to be in the ballpark with Prometheus counters.
For me it looks like a correct usage of OTel Metrics SDK.
To summarize my findings in https://github.com/open-telemetry/opentelemetry-go/pull/5544.
- We should take a close look at the exemplar reservoir performance when exemplars are disabled. It currently makes up a substantial (~50%) portion of the overhead for the no-attributes case.
- OTel could potentially consider bound instruments if we want to achieve performance similar to the prometheus client. Bound instruments would potentially eliminate ~95% of the current overhead (excluding overhead from exemplar recording) for the with-attributes case.
- For a very small improvement, we could consider optimizing the code which increments our counters values similar to what prometheus has done. This would be more noticeable if we implemented other the optimizations above.
Bound instruments OTEP https://github.com/open-telemetry/oteps/blob/main/text/metrics/0070-metric-bound-instrument.md
We should take a close look at the exemplar reservoir performance when exemplars are disabled. It currently makes up a substantial (~50%) portion of the overhead for the no-attributes case.
This appears to be because of the time.Now() call for each measurement. We should at least consider moving the time.Now call into the exemplar reservoir so that it is only invoked when we are actually recording an exemplar.
I also found that the benchmark did not change if I swapped out the OTel prometheus exporter with a manual reader (which is expected). I'm removing the prometheus exporter label.
https://github.com/open-telemetry/opentelemetry-go/pull/5545 is a ~45% performance improvement for the zero-attributes case, and a ~20% performance improvement for the single-attribute case.