sentry-java Java: POTel Research

Notion POTEL Doc
- Talked to Francesco to learn from Node/JS changes

### Statement on OTel support - tech feasibility and quality
- [ ] Java (and frameworks)
- [ ] Kotlin
- [ ] Android

Currently working on a backend POC
- trying to translate JS to Java then go from there and check for blockers, required API changes etc.
- [x] increasing internal list of spans
  - [x] requests to sentry are not filtered properly
- [x] spans are reported mulitple times with increasing duration
- [ ] Check differences in context handling
Also need to do a POC for Android + OTEL to check for blockers
Check how good auto instrumentation for Spring (Boot) is without the Agent

Context Handling

Attributes vs Context

Attributes can only hold few types (String, Boolean, Long, Double) and arrays thereof (see AttributeType)
Context can also hold complex types
[ ] Need to test whether Context.current() works in SpanProcessor
- [x] Initial test looks promising
- [ ] but there might be more complex scenarios where this does not work
Attributes can be retrieved from ReadableSpan in SpanProcessor.onEnd()
Context can not be retrieved from ReadableSpan
Context has an internal split, when AgentContextWrapper is used (presumably this is always the case when using the OTEL Java Agent)
- agentContext vs applicationContext

ContextStorageProvider

We could implement a custom ContextStorageProvider to intercept whenever context is replaced and fork scope
We'll also have to figure out when to fork isolation scope

Hubs and Scopes merge

https://develop.sentry.dev/sdk/hub_and_scope_refactoring/
Hub will be removed long term (future major release)
Hub can simply forward to scope under the hood so we may be able to change the code without requiring a major

Execution context

[ ] Should we explicitly have ExecutionContext in code to wrap isolation scope, current scope and maybe also current span?
[ ] Would be the replacement thread local variable for currentHub

Scopes

There will be multiple scopes
- Global scope
  - Truely global scope (process wide), not per client
  - Can be written to even before Sentry.init
  - Can be used e.g. to add tags that will be added to all events etc.
    - before this basically required users to add tags immediately after Sentry.init was called
    - this can now be customized even from asynchronously running code, e.g. after some lib initializes or after some API call was made
  - Holds a reference to the default global client
  - to add tags, set user etc. users need to explicitly perform Sentry.getGlobalScope().setTag() etc.
- Isolation scope
  - similar to a global scope, but tied to a specific section of your app/code that should be isolated
    - e.g. for the duration of handling a request server side
  - Users don't have to worry about writing to a scope that doesn't apply to the whole request by accident, they can now use isolation scope e.g. to add tags or set a user that is applied to the whole request
  - Global methods to set data, like Sentry.setTag() or Sentry.setUser(), will set on the isolation scope.
- Current scope
  - Global methods that capture like Sentry.captureException(e) will capture on the current scope
  - While capturing global scope, isolation scope and current scope are merged and applied to events

Scope forking

See https://miro.com/app/board/uXjVNtPiOfI=/

When forking current scope (with new_scope)
- forks current scope and sets the fork as active current scope
- does not fork isolation scope (isolation scope remains untouched)
- [ ] this is the only of these methods that's part of public API, others can be usable by users but not part of public API directly
When setting current scope (with use_scope(scope))
- uses passed in scope as active current scope
- does not fork current scope
- isolation scope remains untouched
When forking isolation scope (with isolation_scope)
- forks isolation scope and sets the fork as active isolation scope
- forks current scope and sets the fork as active current scope
When setting isolation scope (with use_isolation_scope(isolation_scope))
- uses passed in isolation scope as active isolation scope
- [ ] do we even need this?
When setting isolation scope and current scope
- uses passed in isolation scope as active isolation scope
- uses passed in current scope as active current scope
- does not fork either scope
- [ ] maybe this should be called with execution_context(execution_context)

Applying scopes to events

  public captureEvent(event: Event, additionalScope?: Scope) {
    // Global scope is always applied first
    const scopeData = getGlobalScope().getScopeData();

    // Apply isolations cope next
    const isolationScope = getIsolationScope();
    merge(scopeData, isolationScope.getScopeData());

    // Now the scope data itself is added
    merge(scopeData, this.getScopeData());

    // If defined, add the captureContext/scope
    // This is e.g. what you pass to Sentry.captureException(error, { tags: [] })
    if (additionalScope) {
      merge(scopeData, additionalScope.getScopeData());
    }

    // Finally, this is merged with event data, where event data takes precedence!
    mergeIntoEvent(event, scopeData);
  }
}

Transferring execution context

When moving execution from one thread to another, we should set isolation scope and current scope to the same ones on the new thread
For reactive programming libraries we'll have to backup and restore isolation scope, current scope and maybe also the current span (if it's not set on the scope but directly on execution context)
We should probably not fork isolation scope likely to not diminish its usefulness
see https://miro.com/app/board/uXjVNtPiOfI=/

Tracing without Performance (TwP) / `PropagationContext`

Should this be put on isolation scope?
Should we have a more abstract API for isolation that
- forks isolation scope
- forks current scope
- creates a new PropagationContext

Copy on write

[ ] Scopes should only be actually copied when there's a change.
- For current scope there should be a very limited number of actual changes but lots of forks
- [ ] Is the most likely change happening setting a new active span?

Locking

[ ] we usually lock write access to scope via withPropagationContext, withSession, setTransaction/clearTransaction, startSession/endSession, how will this behave in the future?

Lifecycle management

[ ] Things like pushScope and pushIsolationScope could return an AutoClosable where on close we restore the previous scope. This would allow users to simply write try(x = pushScope()) { ... } instead of manually having to popScope. It'd also allow for restoring previous state without risking an imbalance between pushScope and popScope calls. OTEL does this, e.g. in Context.makeCurrent().

Migration docs

[ ] May need to make it extra clear in changelog, migration guide etc. that top level API (e.g. Sentry.setTag()) now goes to isolation scope whereas e.g. Sentry.configureScope(s -> s.setTag()) goes to the current scope. This may lead to unexpected scope leaks for our users.
[ ] scope
- [ ] pushScope+popScope -> ?
- [ ] withScope -> ?
- [ ] configureScope -> ?
[ ] direct hub usage
- [ ] hub.captureX -> scope.captureX ?

Document common use cases

[ ] setting tags, breadcrumbs, user, ... for the current request -> Sentry.addTag()
[ ] setting tags, breadcrumbs, user, ... globally -> Sentry.getGlobalScope().setX etc.

Feb 05 '24 12:02 smeubank

@adinauer i converted this to an issue, so we can make more proper task lists and potentially just convert it to an issue/epic later

Feb 05 '24 12:02 smeubank

After daily discussion about OTEL Context forking with @sl0thentr0py , @antonpirker, @sentrivana , @szokeasaurusrex

Core question

Q: In case there is no hook for any given SDK, should we fall back to a global storage plus OTEL Context usage combination or try and get a proper hook by contributing upstream (OTEL) to completely rely on OTEL Context for storing Sentry Scopes?
A: Try and get the hook upstream, see how that goes and only if it doesn't get merged consider the fallback solution.

Side note: Sentry Scopes is just a collection object for an isolation Scope as well as a current Scope.

Miro board showcasing pure Context usage and combined approach

How does this work in JS SDK?

JS SDK uses a hook provided by OTEL called ContextManager.with(...) which allows them to return a modified copy of OTEL Context containing Sentry Scopes.

While there also is a global storage for OTEL spans, this should only be needed for building Sentry transactions. Once span streaming lands, we should be able to remove that global storage.

While for JS SDK there also is a distinction between Context forking and it being sent to ContextManager.with(...) API should make sure, Context forks go through ContextManager.with(...).

Why do we care about OTEL `Context`?

OTEL takes care of propagating that Context e.g. from thread A to thread B, stashes Context and restores it whenever needed. We want to rely on this mechanism as otherwise we'd have to provide our own propagation mechanism for each library supported by OTEL. This means we wouldn't be able to simply leverage available OTEL auto instrumentation but have to implement a propagation mechanism that's already there in OTEL via Context.

Can't we simply propagate our `Scopes` ourselves and ignore OTEL `Context`?

We could but we'd have to provide a counterpart for every OTEL Context propagation mechanism. We would no longer be able to just rely on OTEL for instrumentations but have to mirror some of what they do. This would also have to be kept up to date. Whenever OTEL changes something about how they propagate Context we'd have to adapt as well. Keeping Context and Scopes in sync would be hard and possibly also hard to detect.

Examples of where we'd have to provide our own propagation:

Reactive libraries (current Sentry implementation vs OTEL)
Executor libraries where you can schedule a task to be run on another thread (current Sentry implementation

Fallback solution

For Java SDK SpanProcessor seems like the only place where we can intercept Context forking - at least for cases where Context is forked due to a new OTEL span being created.

There is no available mechanism for manipulating Context and have it used. While a parent Context is handed into SpanProcessor, Context is immutable so we could only create a modified copy of it but there's no way of e.g. returning it and having it used by OTEL.

This means we have to keep a global storage, where OTEL span is used as a weakly referenced key and Sentry Scopes is the value. Now if we combine this with Sentry.withScope API - which can be called at any time - there's no easy way to add those to global storage anymore. Instead we could rely on ~ Context.with(SentryScopes.fork()).makeCurrent() inside Sentry.withScope and similar API. This means for cases where we create a new scope, we add it to the OTEL Context and manage lifecycle to restore the previous Context once e.g. Sentry.withScope ends.

Due to now having two places where we can store Sentry Scopes, whenever the current Scope is retrieved, e.g. via Sentry.getCurrentScope(), we have to look at both places. We could look at Scope hierarchy and check if any of the parents of the Scope stored in Context is the Scope we created for an OTEL span so we know whether it's nested inside the span Scope (disclaimer: didn't spend too much time thinking through how this would actually have to work). Where this falls short is when Sentry.setCurrentScope(new Scope()) is called because then hierarchy isn't available. There may also be other cases where this gets complicated - ideally we don't go down this road and not have to worry about it.

Intercepting `Context` storage

In Java SDK it's possible to provide a custom ContextStorage. While this makes it possible to modify Context before storing it, there's some pitfalls:

Context forking isn't the same thing as Context storing. A Context may be forked without ever being stored in this ContextStorage.
ContextStorage returns an OTEL Scope, which is a lifecycle management object to restore the previous Context. It does not allow to return a modified Context so if a user calls .makeCurrent() on a context, the Context that ends up in storage isn't the same one the user references but a modified copy of that.
It's currently unclear if these problems matter for us, but in Java SDK, API makes it easy to fork a Context without ever sending it to store via .makeCurrent().
~~We couldn't get our custom ContextStorage to be used by both the application and the agent via ContextStorageProvider. Instead we'd have to use non public API ContextStorageWrappers.addWrapper to register our custom storage. This could easily break in future versions.~~ We now use ContextStorage.addWrapper to register our wrapper around ContextStorage, which is public API.

Some more alternatives and why they aren't helping can be found in https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/10691

Side topic: Two way reference between OTEL `Context` and Sentry `Scope`

When implementing POTEL for any SDK we should try and avoid this as it could easily get out of sync and lead to bugs.

Feb 28 '24 11:02 adinauer

Why we need to globally store Sentry `Scopes` even if we have a proper `Context` forking hook anyways

In SpanProcessor.onEnd and by extension also SpanExporter.export, Context.current() returns the parent Context and not the Context that was active during span execution. This seems to be by design in OTEL. This means even if we managed to add our Scopes to Context, we can't access it in SpanProcessor.onEnd().

To work around this, we're planning to use a global storage that weakly references OTEL span -> Scopes. Whenever SpanProcessor.onStart is called, we stash away Scopes referenced weakly by the OTEL span.

We'd like to fork isolation scope in Propagator.extract as that should be called for all server / consumer use cases we we likely want a new isolation scope.

We do not yet know how we want to handle current scope in combination with SpanProcessor.

In JS, there's a shared Scopes reference between global storage and Context which is updated in a custom copy of the OTEL http integration. The http integration forks isolation scope and sets it on the shared Scopes instance.

I gave wrapping the OTEL Span a quick shot but didn't go through (yet).

Wrapping Span in ContextStorage / ContextWrapper:

we're not guaranteed to see all Spans there
while this may work for auto instrumentation, users performing manual instrumentation can simply call span.end() on the reference they got from spanBuilder.startSpan() instead of calling Span.current().end() (where our interception could do something) see OTEL docs.
To wrap Span creation in SpanBuilder we'd have to open up a bunch of non public API in opentelemetry-java (SentrySdkTracerProviderBuilder, SdkTracerProvider, SdkSpan) and extract some interfaces (SdkTracerProvider SdkSpan). Even then, we'd have to wrap all the way from SentrySdkTracerProviderBuilder through SdkTracerProvider, TracerBuilder, Tracer, SpanBuilder to Span/SdkSpan.
There's some places in OTEL SDK where they cast to SdkSpan meaning we'd have to extract an interface.
With this much wrapping, and delegating to original instances, we'd have to maintain this delegation between releases of OTEL SDK since new methods could be added to interfaces with default imlementation. This wouldn't be a breaking change but our wrappers would also have to be updated to delegate those new methods, otherwise behaviour could break.

For other SDKs it might be possible to attach the Sentry Scopes to the OTEL span in SpanProcessor.onStart(), so we have access to it when forking/storing Context and in SpanExporter.export().

Mar 05 '24 11:03 adinauer

Why do we need the Hubs/Scopes merge for POTEL?

We do want to go through with the Hubs/Scopes merge in all of our SDKs regardless of POTEL. We want to remove the hub concept for users and also provide a "copy-on-write" that prevents footguns like popScope in an async execution context where it affects previous execution context because the user forgot to clone the hub. With POTEL the problem of having to write to the correct scope becomes even more complex since we'd fork scope much more often than in the current version of the SDK. Isolation scope aims to solve that.

We want to have a scope forking system that is consistent with how OTEL does forking.

Some of the limitations with current Hubs/Scopes can be found here.

Here's some reasons I found but so far I'm not sure any of those actually apply to the Java SDK in a way where it wouldn't be possible to make POTEL work with hubs.

Context is immutable in OTEL, and works via copy-on-write - so each time you want to change anything on the context, you fork it. This means that you have much more forking than you have in Sentry currently (source: RFC and Dev Docs)
- copy-on-write here is referring to avoiding a problem with current SDK where unless you clone the hub, any async execution can manipulate the previous execution, e.g. by calling popScope
since you always fork a context when you change the active span (as the span is set on the context, and updating it means forking a context), you can quickly get into deeply nested context situations due to auto instrumentation. (source: RFC)
- This is the reason for adding isolation scope which again is part of providing the copy-on-write mentioned above
- Hub/Scope forking can already be an issue today in SDKs but it is exagerated by switching to POTEL
OTEL auto instrumentation must attach correctly to whatever is the “active span”. This means if I create an active span in Sentry via Sentry.startActiveSpan this must also become the active span on the active context in OTEL. (source: Notion)
- ❓ If we use OTEL API to create a span from the Sentry SDK and make that span the current one, shouldn't that write to the OTEL Context and solve this?
If I use scope.getSpan() I must get the active Sentry span. This means I always need to know what the active Sentry span is, even when we are auto instrumenting with OTEL. (source: Notion)
- ❓ Shouldn't we simply be able to retrieve the current span from OTEL Context.current().getSpan()
- ❓ Is there even a concept of a Sentry span anymore or is this simply a wrapper around an OTEL span?
Hub splitting must match context creation in OTEL. This means that whenever we create a separate hub for an execution context in Sentry, this must be synced with the context in OTEL, otherwise errors/spans/breadcrumbs etc. may mismatch. (source: Notion)
- ❓ For Java SDK this should be as simple as Context.current().with(hub).makeCurrent()

Mar 06 '24 11:03 adinauer

How can `Scopes` storage work?

See miro top left which shows a flow through some OTEL auto instrumentation and Sentry API calls.

In Propagator.extract() we fork both current and isolation Scope and add that to the Context
In SpanProcessor.onStart() we fork current scope but we can only store it in global storage
- because we can't be sure execution went through Propagator but we always want a forked Scope for any OTEL span.
At some point OTEL auto instrumentation calls context.makeCurrent() which we intercept in SentryContextStorage. Here we wrap Context with a SentryContextWrapper if it hasn't yet. Since this Context may already have an OTEL span attached, we fork current scope again. For forking we have to check if the Scopes in the Context is an ancestor of the Scopes stored in global storage for the OTEL span in Context. If that's the case we instead fork the Scopes in global storage and set that as Scopes in Context.
Sentry.withScope API has to also fork current scope. Then it sets it on Context and stores that: Context.current().with(forkedScopes).makeCurrent(). It needs to hold on to the OTEL Scope that is returned as a lifecycle management object and call close (or rely on auto close) to restore the previous Context in storage.
In SpanProcessor.onEnd() we do not have access to Context.current(). While it may sometimes return the parent Context (i.e. the Context that was active, when the span was started) it may also return an unrelated Context in some cases.
... child spans go through steps 2,3 and 5 as well
Eventually the OTEL spans make it into SpanExporter.export which only has access to Scopes in the global storage. It can use those for additional data when creating transactions and spans.

Pitfalls:

The scope forking performed by SpanProcessor.onStart() is stored in global storage but Context also has a Scopes instance which is the parent, likely created by Propagator or a previous span / .withScope. This is why we have to do the ancestry check and prefer the Scopes from global storage.
We can only intercept Context forking via SentryContextWrapper once that Context has gone through SentryContextStorage at least once. Before that point we can only intercept Context storage.

Mar 07 '24 14:03 adinauer

POTEL Android

OTEL Android SDK

https://github.com/open-telemetry/opentelemetry-android Still in experimental stage. Provides features similar to Sentry-Android, however, in a very early stage and without any documentation on how to get this running. Bytecode instrumentation is done via bytebuddy (currently only available for okhttp). Tested with RumInitializer based on this medium post: https://medium.com/lumigo/exploring-the-new-frontiers-of-opentelemetry-android-b55979097538.

Running our instrumentation sample with the OTEL Android SDK and a Jaeger Backend, starting the app yields the following results:

Overview

Detail

Combining Sentry with OTEL Android SDK

The work done by @adinauer for OTEL on the backend can be used to initialize the OTEL Android SDK in a way that the Spans created by OTEL are sent to Sentry. By using our SentryContextStorage, PotelSentrySpanProcessor and SentrySpanExporter. Additional information that is written to the spans by the OTEL Android SDK (as seen in the jaeger screenshot above) would need to be extracted by the SentrySpanExporter to make them usable by Sentry (e.g. ScreenName, os, etc.)

With Manual init (Sentry.init) we get the following results in Sentry:

Overview

Detail

Taking all OtelSpan attributes and setting them as tags on the SentryTransaction in the SentrySpanExporter yields the following (these would need to be mapped to the correct keys we expect in sentry): Screenshot 2024-03-13 at 08 49 41

Performance using OTEL Java SDK combined with Sentry-Android incl. gradle plugin

This is probably the most the solution we should try to achieve. Starting a transaction using sentry (Sentry.startTransaction) would create an OTEL root span and creating child spans for transactions would also create OTEL spans. Thus we could keep the Java and Android SDK codebases in sync.

Open Questions for this approach:

How are Metrics/Profilers related to Performance product (are they bound to spans/transactions?) How can we handle this when using otel spans?

Mar 13 '24 07:03 lbloder

POTEL Android size comparison

SentryAndroid with SAGP + all instrumentations enabled

Bundle: 4.6 MB

SentryAndroid with SAGP + all instrumentations enabled+ OTEL Java SDK with SAGP

Bundle: 4.7 MB
Additional Dependencies compared to vanilla:
- io.sentry:sentry-opentelemetry-core:7.6.0 (local snapshot with new otel impl)
- io.opentelemetry:opentelemetry-sdk:1.36.0
- io.opentelemetry.semconv:opentelemetry-semconv:1.23.1-alpha

SentryAndroid with SAGP + all instrumentations enabled + OTEL Android SDK

Bundle: 5.3 MB
Additional Dependencies compared to vanilla:
- io.sentry:sentry-opentelemetry-core:7.6.0 (local snapshot with new otel impl)
- io.opentelemetry.android:instrumentation:0.4.0-alpha
- io.opentelemetry:opentelemetry-sdk:1.36.0
- io.opentelemetry.instrumentation:opentelemetry-instrumentation-api:2.1.0

SentryAndroid with SAGP + all instrumentations enabled + OTEL Android SDK

Bundle: 5.4 MB
Additional Dependencies compared to vanilla:
- io.sentry:sentry-opentelemetry-core:7.6.0 (local snapshot with new otel impl)
- io.opentelemetry.android:instrumentation:0.4.0-alpha
- io.opentelemetry:opentelemetry-sdk:1.36.0
- io.opentelemetry.instrumentation:opentelemetry-instrumentation-api:1.32.1

Mar 18 '24 09:03 lbloder

sentry-java sentry-java copied to clipboard

Java: POTel Research

Context Handling

Attributes vs Context

ContextStorageProvider

Hubs and Scopes merge

Execution context

Scopes

Scope forking

Applying scopes to events

Transferring execution context

Tracing without Performance (TwP) / PropagationContext

Copy on write

Locking

Lifecycle management

Migration docs

Document common use cases

Core question

How does this work in JS SDK?

Why do we care about OTEL Context?

Can't we simply propagate our Scopes ourselves and ignore OTEL Context?

Fallback solution

Intercepting Context storage

Side topic: Two way reference between OTEL Context and Sentry Scope

Why we need to globally store Sentry Scopes even if we have a proper Context forking hook anyways

Why do we need the Hubs/Scopes merge for POTEL?

How can Scopes storage work?

POTEL Android

OTEL Android SDK

Overview

Detail

Combining Sentry with OTEL Android SDK

Overview

Detail

Performance using OTEL Java SDK combined with Sentry-Android incl. gradle plugin

Open Questions for this approach:

POTEL Android size comparison

SentryAndroid with SAGP + all instrumentations enabled

SentryAndroid with SAGP + all instrumentations enabled+ OTEL Java SDK with SAGP

SentryAndroid with SAGP + all instrumentations enabled + OTEL Android SDK

SentryAndroid with SAGP + all instrumentations enabled + OTEL Android SDK

sentry-java
sentry-java copied to clipboard

Tracing without Performance (TwP) / `PropagationContext`

Why do we care about OTEL `Context`?

Can't we simply propagate our `Scopes` ourselves and ignore OTEL `Context`?

Intercepting `Context` storage

Side topic: Two way reference between OTEL `Context` and Sentry `Scope`

Why we need to globally store Sentry `Scopes` even if we have a proper `Context` forking hook anyways

How can `Scopes` storage work?