opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

receiver/prometheus: Populate scope attributes from labels instead of otel_scope_info

Open ArthurSens opened this issue 6 months ago • 7 comments

This PR is a work in progress. It cannot handle multiple metrics that should be treated as having the same Scope.

Description

This PR is a PoC for the spec change described at https://github.com/open-telemetry/opentelemetry-specification/issues/4223.

In the specification change, it's proposed that Prometheus exporters stop generating the metric otel_scope_info with all the scope attributes and instead add all scope information as labels, turning the Scope information "identifying".

For the receiver, this means that Scope should no longer be populated from the otel_scope_info but should instead be looked for metrics with the same labels prefixed with otel_scope_.

For users who still haven't updated to newer versions of the SDK that adhere to the spec change, ignoring the otel_scope_info metric would be a breaking change from this receiver's perspective. For that reason, this change is being made through feature flags.

If an exporter exposes otel_scope_info and the feature gate receiver.prometheusreceiver.RemoteScopeInfo is enabled, then otel_scope_info will be transformed into a metric like all others instead of being cached and used to generate ScopeMetrics information

ArthurSens avatar May 13 '25 22:05 ArthurSens

PoC ready for review!

ArthurSens avatar May 25 '25 12:05 ArthurSens

Has it been tested together with https://github.com/open-telemetry/opentelemetry-go/pull/5947?

This is what I guess from: https://github.com/open-telemetry/opentelemetry-go/pull/5947#issuecomment-2904114141

CC @jade-guiton-dd

pellared avatar May 25 '25 19:05 pellared

@pellared What I tested was that the Go SDK's Prometheus exporter runs properly when the Collector emits internal metrics that only differ in their scope attributes. I haven't tested Contrib's receiver.

jade-guiton-dd avatar May 26 '25 10:05 jade-guiton-dd

I'll run some tests today with the full pipeline: Go-SDK -> Prometheus Receiver -> Prometheus Exporter/ Debug Exporter

ArthurSens avatar May 26 '25 10:05 ArthurSens

Manual tests are showing that I broke the current behavior that extracts attributes from otel_scope_info metrics and when the feature gate is enabled, I'm getting nil pointers xD

Back to draft

ArthurSens avatar May 26 '25 15:05 ArthurSens

Still working on this and still strugging to fix it. I've changed the test strategy to something that tests more end2end, and this one triggers the nilpointers I'm seeing when doing manual tests with the whole pipeline go-sdk->prom-receiver->debug exporter

ArthurSens avatar May 30 '25 00:05 ArthurSens

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jun 13 '25 05:06 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Jun 28 '25 05:06 github-actions[bot]

I was traveling for a few weeks. I'm reopening since this is still important

ArthurSens avatar Jun 30 '25 13:06 ArthurSens

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jul 15 '25 05:07 github-actions[bot]

I'm going to adopt this difficult PR as a way to learn more about the codebase. I do not promise swift results :)

ywwg avatar Jul 16 '25 20:07 ywwg

some initial questions on the design as I take over this PR as we discussed. Let's chat about them, maybe tomorrow?

absolutely!

ArthurSens avatar Jul 17 '25 18:07 ArthurSens

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Aug 01 '25 05:08 github-actions[bot]

Ok, I finally found some time to get back to this. I ended up fixing the previous problems by introducing an interface. I was afraid this would introduce some performance penalty, so I implemented a benchmark before proceeding. CPU profiles show increase in runtime operations from ~20s to ~22 between benchmarks. Allocations are going mostly to the new function getScopeIdentifier and for hashing the scope key.

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver/internal
cpu: Apple M2 Pro
                                       │ internal/bench-main.txt │         benchmark-new.txt          │
                                       │         sec/op          │   sec/op     vs base               │
TransactionAppendAndCommit/Floats-2                  2.668µ ± 2%   2.872µ ± 2%   +7.63% (p=0.002 n=6)
TransactionAppendAndCommit/Histogram-2               6.550µ ± 1%   7.263µ ± 2%  +10.88% (p=0.002 n=6)
geomean                                              4.181µ        4.567µ        +9.24%

                                       │ internal/bench-main.txt │         benchmark-new.txt          │
                                       │          B/op           │     B/op      vs base              │
TransactionAppendAndCommit/Floats-2                 2.410Ki ± 0%   2.555Ki ± 0%  +6.00% (p=0.002 n=6)
TransactionAppendAndCommit/Histogram-2              5.465Ki ± 0%   5.918Ki ± 0%  +8.29% (p=0.002 n=6)
geomean                                             3.629Ki        3.888Ki       +7.14%

                                       │ internal/bench-main.txt │         benchmark-new.txt         │
                                       │        allocs/op        │ allocs/op   vs base               │
TransactionAppendAndCommit/Floats-2                   22.00 ± 0%   26.00 ± 0%  +18.18% (p=0.002 n=6)
TransactionAppendAndCommit/Histogram-2                48.00 ± 0%   64.00 ± 0%  +33.33% (p=0.002 n=6)
geomean                                               32.50        40.79       +25.53%

image image

Not sure if we're in the position to care too much about the performance, but please let me know if I could be taking another approach!

ArthurSens avatar Aug 12 '25 00:08 ArthurSens

I think we should proceed with whatever you have working, and open an issue with your profiles, etc to look into optimizing that code path.

dashpole avatar Aug 12 '25 14:08 dashpole

I think we should proceed with whatever you have working, and open an issue with your profiles, etc to look into optimizing that code path.

Once the otel_scope_info functionality is fully removed, we could remove the interface again and then the compiler will be able to inline a lot of the codepath again 🤔

ArthurSens avatar Aug 13 '25 18:08 ArthurSens

I imagine people will have a mix of clients for some time. Some will use the new scope attributes on metrics, and others will still be using otel_scope_info. I think we should always add otel_scope_foo attributes as scope attributes, and use the feature gate to remove special handling of otel_scope_info.

Hmmm, good point. Let me change that

ArthurSens avatar Sep 01 '25 18:09 ArthurSens

Okay, I've added a commit changing the behavior of the feature gate. Now, we're always parsing attributes from labels prefixed with otel_scope_, and the feature gate controls whether we merge attributes from the otel_scope_info metric or not.

Could we first review this part, and if necessary, we can refactor the duplicate map(https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/40060#discussion_r2304136797) after this review?

ArthurSens avatar Sep 01 '25 22:09 ArthurSens

Could we first review this part, and if necessary, we can refactor the duplicate map(https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/40060#discussion_r2304136797) after this review?

Sure

dashpole avatar Sep 02 '25 14:09 dashpole

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Sep 17 '25 05:09 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions[bot] avatar Oct 02 '25 05:10 github-actions[bot]