sentry-java
sentry-java copied to clipboard
Java: POTel Research
- Notion POTEL Doc
- Talked to Francesco to learn from Node/JS changes
### Statement on OTel support - tech feasibility and quality
- [ ] Java (and frameworks)
- [ ] Kotlin
- [ ] Android
- Currently working on a backend POC
- trying to translate JS to Java then go from there and check for blockers, required API changes etc.
- [x] increasing internal list of spans
- [x] requests to sentry are not filtered properly
- [x] spans are reported mulitple times with increasing duration
- [ ] Check differences in context handling
- Also need to do a POC for Android + OTEL to check for blockers
- Check how good auto instrumentation for Spring (Boot) is without the Agent
Context Handling
Attributes vs Context
- Attributes can only hold few types (
String,Boolean,Long,Double) and arrays thereof (seeAttributeType) - Context can also hold complex types
- [ ] Need to test whether
Context.current()works inSpanProcessor- [x] Initial test looks promising
- [ ] but there might be more complex scenarios where this does not work
- Attributes can be retrieved from
ReadableSpaninSpanProcessor.onEnd() - Context can not be retrieved from
ReadableSpan - Context has an internal split, when
AgentContextWrapperis used (presumably this is always the case when using the OTEL Java Agent)agentContextvsapplicationContext
ContextStorageProvider
- We could implement a custom
ContextStorageProviderto intercept whenever context is replaced and fork scope - We'll also have to figure out when to fork isolation scope
Hubs and Scopes merge
- https://develop.sentry.dev/sdk/hub_and_scope_refactoring/
- Hub will be removed long term (future major release)
- Hub can simply forward to scope under the hood so we may be able to change the code without requiring a major
Execution context
- [ ] Should we explicitly have
ExecutionContextin code to wrap isolation scope, current scope and maybe also current span? - [ ] Would be the replacement thread local variable for
currentHub
Scopes
- There will be multiple scopes
- Global scope
- Truely global scope (process wide), not per client
- Can be written to even before
Sentry.init - Can be used e.g. to add tags that will be added to all events etc.
- before this basically required users to add tags immediately after
Sentry.initwas called - this can now be customized even from asynchronously running code, e.g. after some lib initializes or after some API call was made
- before this basically required users to add tags immediately after
- Holds a reference to the default global client
- to add tags, set user etc. users need to explicitly perform
Sentry.getGlobalScope().setTag()etc.
- Isolation scope
- similar to a global scope, but tied to a specific section of your app/code that should be isolated
- e.g. for the duration of handling a request server side
- Users don't have to worry about writing to a scope that doesn't apply to the whole request by accident, they can now use isolation scope e.g. to add tags or set a user that is applied to the whole request
- Global methods to set data, like
Sentry.setTag()orSentry.setUser(), will set on the isolation scope.
- similar to a global scope, but tied to a specific section of your app/code that should be isolated
- Current scope
- Global methods that capture like
Sentry.captureException(e)will capture on the current scope - While capturing global scope, isolation scope and current scope are merged and applied to events
- Global methods that capture like
- Global scope
Scope forking
See https://miro.com/app/board/uXjVNtPiOfI=/
- When forking current scope (
with new_scope)- forks current scope and sets the fork as active current scope
- does not fork isolation scope (isolation scope remains untouched)
- [ ] this is the only of these methods that's part of public API, others can be usable by users but not part of public API directly
- When setting current scope (
with use_scope(scope))- uses passed in scope as active current scope
- does not fork current scope
- isolation scope remains untouched
- When forking isolation scope (
with isolation_scope)- forks isolation scope and sets the fork as active isolation scope
- forks current scope and sets the fork as active current scope
- When setting isolation scope (
with use_isolation_scope(isolation_scope))- uses passed in isolation scope as active isolation scope
- [ ] do we even need this?
- When setting isolation scope and current scope
- uses passed in isolation scope as active isolation scope
- uses passed in current scope as active current scope
- does not fork either scope
- [ ] maybe this should be called
with execution_context(execution_context)
Applying scopes to events
public captureEvent(event: Event, additionalScope?: Scope) {
// Global scope is always applied first
const scopeData = getGlobalScope().getScopeData();
// Apply isolations cope next
const isolationScope = getIsolationScope();
merge(scopeData, isolationScope.getScopeData());
// Now the scope data itself is added
merge(scopeData, this.getScopeData());
// If defined, add the captureContext/scope
// This is e.g. what you pass to Sentry.captureException(error, { tags: [] })
if (additionalScope) {
merge(scopeData, additionalScope.getScopeData());
}
// Finally, this is merged with event data, where event data takes precedence!
mergeIntoEvent(event, scopeData);
}
}
Transferring execution context
- When moving execution from one thread to another, we should set isolation scope and current scope to the same ones on the new thread
- For reactive programming libraries we'll have to backup and restore isolation scope, current scope and maybe also the current span (if it's not set on the scope but directly on execution context)
- We should probably not fork isolation scope likely to not diminish its usefulness
- see https://miro.com/app/board/uXjVNtPiOfI=/
Tracing without Performance (TwP) / PropagationContext
- Should this be put on isolation scope?
- Should we have a more abstract API for isolation that
- forks isolation scope
- forks current scope
- creates a new PropagationContext
Copy on write
- [ ] Scopes should only be actually copied when there's a change.
- For current scope there should be a very limited number of actual changes but lots of forks
- [ ] Is the most likely change happening setting a new active span?
Locking
- [ ] we usually lock write access to scope via withPropagationContext, withSession, setTransaction/clearTransaction, startSession/endSession, how will this behave in the future?
Lifecycle management
- [ ] Things like
pushScopeandpushIsolationScopecould return anAutoClosablewhere onclosewe restore the previous scope. This would allow users to simply writetry(x = pushScope()) { ... }instead of manually having to popScope. It'd also allow for restoring previous state without risking an imbalance betweenpushScopeandpopScopecalls. OTEL does this, e.g. inContext.makeCurrent().
Migration docs
- [ ] May need to make it extra clear in changelog, migration guide etc. that top level API (e.g.
Sentry.setTag()) now goes to isolation scope whereas e.g.Sentry.configureScope(s -> s.setTag())goes to the current scope. This may lead to unexpected scope leaks for our users. - [ ] scope
- [ ] pushScope+popScope -> ?
- [ ] withScope -> ?
- [ ] configureScope -> ?
- [ ] direct hub usage
- [ ] hub.captureX -> scope.captureX ?
Document common use cases
- [ ] setting tags, breadcrumbs, user, ... for the current request ->
Sentry.addTag() - [ ] setting tags, breadcrumbs, user, ... globally ->
Sentry.getGlobalScope().setXetc.
@adinauer i converted this to an issue, so we can make more proper task lists and potentially just convert it to an issue/epic later
After daily discussion about OTEL Context forking with @sl0thentr0py , @antonpirker, @sentrivana , @szokeasaurusrex
Core question
- Q: In case there is no hook for any given SDK, should we fall back to a global storage plus OTEL
Contextusage combination or try and get a proper hook by contributing upstream (OTEL) to completely rely on OTELContextfor storing SentryScopes? - A: Try and get the hook upstream, see how that goes and only if it doesn't get merged consider the fallback solution.
Side note: Sentry Scopes is just a collection object for an isolation Scope as well as a current Scope.
Miro board showcasing pure Context usage and combined approach
How does this work in JS SDK?
JS SDK uses a hook provided by OTEL called ContextManager.with(...) which allows them to return a modified copy of OTEL Context containing Sentry Scopes.
While there also is a global storage for OTEL spans, this should only be needed for building Sentry transactions. Once span streaming lands, we should be able to remove that global storage.
While for JS SDK there also is a distinction between Context forking and it being sent to ContextManager.with(...) API should make sure, Context forks go through ContextManager.with(...).
Why do we care about OTEL Context?
OTEL takes care of propagating that Context e.g. from thread A to thread B, stashes Context and restores it whenever needed. We want to rely on this mechanism as otherwise we'd have to provide our own propagation mechanism for each library supported by OTEL. This means we wouldn't be able to simply leverage available OTEL auto instrumentation but have to implement a propagation mechanism that's already there in OTEL via Context.
Can't we simply propagate our Scopes ourselves and ignore OTEL Context?
We could but we'd have to provide a counterpart for every OTEL Context propagation mechanism. We would no longer be able to just rely on OTEL for instrumentations but have to mirror some of what they do. This would also have to be kept up to date. Whenever OTEL changes something about how they propagate Context we'd have to adapt as well. Keeping Context and Scopes in sync would be hard and possibly also hard to detect.
Examples of where we'd have to provide our own propagation:
- Reactive libraries (current Sentry implementation vs OTEL)
- Executor libraries where you can schedule a task to be run on another thread (current Sentry implementation
Fallback solution
For Java SDK SpanProcessor seems like the only place where we can intercept Context forking - at least for cases where Context is forked due to a new OTEL span being created.
There is no available mechanism for manipulating Context and have it used. While a parent Context is handed into SpanProcessor, Context is immutable so we could only create a modified copy of it but there's no way of e.g. returning it and having it used by OTEL.
This means we have to keep a global storage, where OTEL span is used as a weakly referenced key and Sentry Scopes is the value. Now if we combine this with Sentry.withScope API - which can be called at any time - there's no easy way to add those to global storage anymore. Instead we could rely on ~ Context.with(SentryScopes.fork()).makeCurrent() inside Sentry.withScope and similar API. This means for cases where we create a new scope, we add it to the OTEL Context and manage lifecycle to restore the previous Context once e.g. Sentry.withScope ends.
Due to now having two places where we can store Sentry Scopes, whenever the current Scope is retrieved, e.g. via Sentry.getCurrentScope(), we have to look at both places.
We could look at Scope hierarchy and check if any of the parents of the Scope stored in Context is the Scope we created for an OTEL span so we know whether it's nested inside the span Scope (disclaimer: didn't spend too much time thinking through how this would actually have to work). Where this falls short is when Sentry.setCurrentScope(new Scope()) is called because then hierarchy isn't available. There may also be other cases where this gets complicated - ideally we don't go down this road and not have to worry about it.
Intercepting Context storage
In Java SDK it's possible to provide a custom ContextStorage. While this makes it possible to modify Context before storing it, there's some pitfalls:
Contextforking isn't the same thing asContextstoring. AContextmay be forked without ever being stored in thisContextStorage.ContextStoragereturns an OTELScope, which is a lifecycle management object to restore the previousContext. It does not allow to return a modifiedContextso if a user calls.makeCurrent()on a context, theContextthat ends up in storage isn't the same one the user references but a modified copy of that.- It's currently unclear if these problems matter for us, but in Java SDK, API makes it easy to fork a
Contextwithout ever sending it to store via.makeCurrent(). - ~~We couldn't get our custom
ContextStorageto be used by both the application and the agent viaContextStorageProvider. Instead we'd have to use non public APIContextStorageWrappers.addWrapperto register our custom storage. This could easily break in future versions.~~ We now useContextStorage.addWrapperto register our wrapper aroundContextStorage, which is public API.
Some more alternatives and why they aren't helping can be found in https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/10691
Side topic: Two way reference between OTEL Context and Sentry Scope
When implementing POTEL for any SDK we should try and avoid this as it could easily get out of sync and lead to bugs.
Why we need to globally store Sentry Scopes even if we have a proper Context forking hook anyways
In SpanProcessor.onEnd and by extension also SpanExporter.export, Context.current() returns the parent Context and not the Context that was active during span execution. This seems to be by design in OTEL. This means even if we managed to add our Scopes to Context, we can't access it in SpanProcessor.onEnd().
To work around this, we're planning to use a global storage that weakly references OTEL span -> Scopes. Whenever SpanProcessor.onStart is called, we stash away Scopes referenced weakly by the OTEL span.
We'd like to fork isolation scope in Propagator.extract as that should be called for all server / consumer use cases we we likely want a new isolation scope.
We do not yet know how we want to handle current scope in combination with SpanProcessor.
In JS, there's a shared Scopes reference between global storage and Context which is updated in a custom copy of the OTEL http integration. The http integration forks isolation scope and sets it on the shared Scopes instance.
I gave wrapping the OTEL Span a quick shot but didn't go through (yet).
Wrapping Span in ContextStorage / ContextWrapper:
- we're not guaranteed to see all
Spans there - while this may work for auto instrumentation, users performing manual instrumentation can simply call
span.end()on the reference they got fromspanBuilder.startSpan()instead of callingSpan.current().end()(where our interception could do something) see OTEL docs. - To wrap
Spancreation inSpanBuilderwe'd have to open up a bunch of non public API inopentelemetry-java(SentrySdkTracerProviderBuilder,SdkTracerProvider,SdkSpan) and extract some interfaces (SdkTracerProviderSdkSpan). Even then, we'd have to wrap all the way fromSentrySdkTracerProviderBuilderthroughSdkTracerProvider,TracerBuilder,Tracer,SpanBuildertoSpan/SdkSpan. - There's some places in OTEL SDK where they cast to
SdkSpanmeaning we'd have to extract an interface. - With this much wrapping, and delegating to original instances, we'd have to maintain this delegation between releases of OTEL SDK since new methods could be added to interfaces with default imlementation. This wouldn't be a breaking change but our wrappers would also have to be updated to delegate those new methods, otherwise behaviour could break.
For other SDKs it might be possible to attach the Sentry Scopes to the OTEL span in SpanProcessor.onStart(), so we have access to it when forking/storing Context and in SpanExporter.export().
Why do we need the Hubs/Scopes merge for POTEL?
We do want to go through with the Hubs/Scopes merge in all of our SDKs regardless of POTEL. We want to remove the hub concept for users and also provide a "copy-on-write" that prevents footguns like popScope in an async execution context where it affects previous execution context because the user forgot to clone the hub. With POTEL the problem of having to write to the correct scope becomes even more complex since we'd fork scope much more often than in the current version of the SDK. Isolation scope aims to solve that.
We want to have a scope forking system that is consistent with how OTEL does forking.
Some of the limitations with current Hubs/Scopes can be found here.
Here's some reasons I found but so far I'm not sure any of those actually apply to the Java SDK in a way where it wouldn't be possible to make POTEL work with hubs.
- Context is immutable in OTEL, and works via copy-on-write - so each time you want to change anything on the context, you fork it. This means that you have much more forking than you have in Sentry currently (source: RFC and Dev Docs)
- copy-on-write here is referring to avoiding a problem with current SDK where unless you clone the hub, any async execution can manipulate the previous execution, e.g. by calling
popScope
- copy-on-write here is referring to avoiding a problem with current SDK where unless you clone the hub, any async execution can manipulate the previous execution, e.g. by calling
- since you always fork a context when you change the active span (as the span is set on the context, and updating it means forking a context), you can quickly get into deeply nested context situations due to auto instrumentation. (source: RFC)
- This is the reason for adding isolation scope which again is part of providing the copy-on-write mentioned above
- Hub/Scope forking can already be an issue today in SDKs but it is exagerated by switching to POTEL
- OTEL auto instrumentation must attach correctly to whatever is the “active span”. This means if I create an active span in Sentry via Sentry.startActiveSpan this must also become the active span on the active context in OTEL. (source: Notion)
- ❓ If we use OTEL API to create a span from the Sentry SDK and make that span the current one, shouldn't that write to the OTEL Context and solve this?
- If I use scope.getSpan() I must get the active Sentry span. This means I always need to know what the active Sentry span is, even when we are auto instrumenting with OTEL. (source: Notion)
- ❓ Shouldn't we simply be able to retrieve the current span from OTEL Context.current().getSpan()
- ❓ Is there even a concept of a Sentry span anymore or is this simply a wrapper around an OTEL span?
- Hub splitting must match context creation in OTEL. This means that whenever we create a separate hub for an execution context in Sentry, this must be synced with the context in OTEL, otherwise errors/spans/breadcrumbs etc. may mismatch. (source: Notion)
- ❓ For Java SDK this should be as simple as Context.current().with(hub).makeCurrent()
How can Scopes storage work?
See miro top left which shows a flow through some OTEL auto instrumentation and Sentry API calls.
- In
Propagator.extract()we fork both current and isolationScopeand add that to theContext - In
SpanProcessor.onStart()we fork current scope but we can only store it in global storage- because we can't be sure execution went through
Propagatorbut we always want a forkedScopefor any OTEL span.
- because we can't be sure execution went through
- At some point OTEL auto instrumentation calls
context.makeCurrent()which we intercept inSentryContextStorage. Here we wrapContextwith aSentryContextWrapperif it hasn't yet. Since thisContextmay already have an OTEL span attached, we fork current scope again. For forking we have to check if theScopesin theContextis an ancestor of theScopesstored in global storage for the OTEL span inContext. If that's the case we instead fork theScopesin global storage and set that asScopesinContext. Sentry.withScopeAPI has to also fork current scope. Then it sets it onContextand stores that:Context.current().with(forkedScopes).makeCurrent(). It needs to hold on to the OTELScopethat is returned as a lifecycle management object and callclose(or rely on auto close) to restore the previousContextin storage.- In
SpanProcessor.onEnd()we do not have access toContext.current(). While it may sometimes return the parentContext(i.e. theContextthat was active, when the span was started) it may also return an unrelatedContextin some cases. - ... child spans go through steps 2,3 and 5 as well
- Eventually the OTEL spans make it into
SpanExporter.exportwhich only has access toScopesin the global storage. It can use those for additional data when creating transactions and spans.
Pitfalls:
- The scope forking performed by
SpanProcessor.onStart()is stored in global storage butContextalso has aScopesinstance which is the parent, likely created byPropagatoror a previous span /.withScope. This is why we have to do the ancestry check and prefer theScopesfrom global storage. - We can only intercept
Contextforking viaSentryContextWrapperonce thatContexthas gone throughSentryContextStorageat least once. Before that point we can only interceptContextstorage.
POTEL Android
OTEL Android SDK
- https://github.com/open-telemetry/opentelemetry-android
Still in experimental stage. Provides features similar to Sentry-Android, however, in a very early stage and without any documentation on how to get this running.
Bytecode instrumentation is done via bytebuddy (currently only available for okhttp).
Tested with
RumInitializerbased on this medium post: https://medium.com/lumigo/exploring-the-new-frontiers-of-opentelemetry-android-b55979097538.
Running our instrumentation sample with the OTEL Android SDK and a Jaeger Backend, starting the app yields the following results:
Overview
Detail
Combining Sentry with OTEL Android SDK
The work done by @adinauer for OTEL on the backend can be used to initialize the OTEL Android SDK in a way that the Spans created by OTEL are sent to Sentry. By using our SentryContextStorage, PotelSentrySpanProcessor and SentrySpanExporter. Additional information that is written to the spans by the OTEL Android SDK (as seen in the jaeger screenshot above) would need to be extracted by the SentrySpanExporter to make them usable by Sentry (e.g. ScreenName, os, etc.)
With Manual init (Sentry.init) we get the following results in Sentry:
Overview
Detail
Taking all OtelSpan attributes and setting them as tags on the SentryTransaction in the SentrySpanExporter yields the following (these would need to be mapped to the correct keys we expect in sentry):
Performance using OTEL Java SDK combined with Sentry-Android incl. gradle plugin
This is probably the most the solution we should try to achieve.
Starting a transaction using sentry (Sentry.startTransaction) would create an OTEL root span and creating child spans for transactions would also create OTEL spans. Thus we could keep the Java and Android SDK codebases in sync.
Open Questions for this approach:
How are Metrics/Profilers related to Performance product (are they bound to spans/transactions?) How can we handle this when using otel spans?
POTEL Android size comparison
SentryAndroid with SAGP + all instrumentations enabled
- Bundle: 4.6 MB
SentryAndroid with SAGP + all instrumentations enabled+ OTEL Java SDK with SAGP
- Bundle: 4.7 MB
- Additional Dependencies compared to vanilla:
- io.sentry:sentry-opentelemetry-core:7.6.0 (local snapshot with new otel impl)
- io.opentelemetry:opentelemetry-sdk:1.36.0
- io.opentelemetry.semconv:opentelemetry-semconv:1.23.1-alpha
SentryAndroid with SAGP + all instrumentations enabled + OTEL Android SDK
- Bundle: 5.3 MB
- Additional Dependencies compared to vanilla:
- io.sentry:sentry-opentelemetry-core:7.6.0 (local snapshot with new otel impl)
- io.opentelemetry.android:instrumentation:0.4.0-alpha
- io.opentelemetry:opentelemetry-sdk:1.36.0
- io.opentelemetry.instrumentation:opentelemetry-instrumentation-api:2.1.0
SentryAndroid with SAGP + all instrumentations enabled + OTEL Android SDK
- Bundle: 5.4 MB
- Additional Dependencies compared to vanilla:
- io.sentry:sentry-opentelemetry-core:7.6.0 (local snapshot with new otel impl)
- io.opentelemetry.android:instrumentation:0.4.0-alpha
- io.opentelemetry:opentelemetry-sdk:1.36.0
- io.opentelemetry.instrumentation:opentelemetry-instrumentation-api:1.32.1