Turn off spans from a scope while retaining downstream spans and without breaking traces
Suppose you have a span hierarchy of ({tracerName}:{spanKind}:{spanName}):
tracerA:server:parent -> tracerB:internal:child -> tracerC:client:grandchild
And you decide that internal spans from tracerB are too noisy. You want to turn them off while:
- Not losing downstream spans. You want to keep
tracerC:client:grandchildbecause it contains important information about calls to downstream systems. - Not producing broken traces.
tracerC:client:grandchild's parent should betracerA:server:parent, nottracerB:internal:child.
How can you accomplish this? Well, you pretty much can't. A couple of problems prevent it:
- Samplers dont have access to scope, so there's no SDK configuration mechanism which can do this.
- Even if Samplers could make decisions based on scope, the sampling decision options don't make the desired behavior easy.
- If a sampler returns
DROPfortracerBand delegates toParentBased{root=AlwaysOn}for other spans: Then all descendants oftracerB:internal:childwill be dropped as well. Not what we want. - If a sampler returns
DROPfortracerBand delegates toParentBased{root=AlwaysOn,localParentNotSampled=AlwaysOn}for other spans: Then we end up with a broken trace sincetracerC:client:grandchildwill assign its parent to betracerB:internal:child, which is dropped. Not what we want. - If a sampler returns
DROPfortracerBand delegates toParentBased{root=AlwaysOn,localParentNotSampled=AlwaysOn}and instrumentation is aware and only sets spans in context ifSpan.isRecording()=true: Then we achieve the desired affect.tracerB:internal:childis dropped andtracerC:client:grandchildwill assign its parent to betracerA:server:parent, so we have a complete trace.
- If a sampler returns
- Even if Samplers could make decisions based on scope, its cumbersome to configure a sampler to have special rules for a scope. It's simpler to think of a sampling policy as orthogonal to the decision of which scopes / instrumentation libraries should be enabled.
So there is no way to get the behavior we want today. And even if samplers were given access to scope, we'd still need: a complex / non-standard sampling policy and coordination with instrumentation (i.e. checking Span.isRecording()==true). Ouch.
This is related to #530 and #1588, but the problem framing here is a little different so I think it warrants a separate issue.
Originally opened as https://github.com/open-telemetry/opentelemetry-java/issues/6197 but think we should provide a spec level solution for this.
@jack-berg, do you think this issue has some relationship with this one? https://github.com/open-telemetry/opentelemetry-specification/issues/3205 -- both issues seem to look for a way to remove noise intermediate spans without loosing the span tree.
I propose we solve this problem by starting to more fully realize the potential of scope as a concept. Today, scope is just a piece of metadata attached to records because there is no SDK level configuration surface area involving scopes (besides views) which allow you to select or do anything special with a scope.
I propose we introduce the the notion of scope level configuration for all three signals:
- Give each provider (TracerProvider, MeterProvider, LoggerProvider) the ability to register configuration which applies scope configuration to some selection of matching scopes. Let scopes be selected using a combination of scope name, version, schema, and attributes.
- When evaluating the scope config for a Tracer / Meter / Logger, use the first matching scope config.
- Define each signal with a unique definition of the types of things which are configurable at the scope level for the respective Tracer, Meter, Logger, but introduce uniformity where it makes sense. This reflects that there will be a mixture of overlapping and signal specific scope configuration options.
- To start, give each signal's scope config the ability specify that the scope is disabled. In the future, we might consider introducing scope specific samplers for tracing, or scope specific log level thresholds for logs, etc.
- When a scope is disabled, the tracer / meter / logger uses the behavior of the relevant noop tracer / meter / logger.
- For tracing, the noop tracer returns
Span.wrap(parentContext). This means that disabling a tracer is an effective solution to the broken trace problem discussed in this issue.
The more I think about #1588, the more I think that extending sampler to include access to resource and scope is a leaky abstraction. Sure a sampler could be used to change the behavior of scopes, but if you play that out, we'll end up with samplers with very complicated rules. It seems much cleaner to separate concerns, with a global sampler policy applicable to all scopes, and separate scope level configuration.
We really need to start to realize the potential of the scope concept. This proposal would lay the groundwork for that by introducing the simplest possible scope level configuration (enabled | disabled) while solving a very concrete use case.
@jack-berg, do you think this issue has some relationship with this one? https://github.com/open-telemetry/opentelemetry-specification/issues/3205 -- both issues seem to look for a way to remove noise intermediate spans without loosing the span tree.
@svrnm yes I do. I think right now we only have a hammer (i.e. sampler) so everything looks like a nail. If we had a scope level configuration, it would be a no brainer to add a severity level threshold option for logs, and we probably wouldn't nearly as interested in log sampling.
Thanks, @jack-berg! Utilizing scope to accomplish that kind of "span-level sampling" (for a lack of a better word) is really an interesting concept. It would solve the issue(s) discussed in #3205 and probably a few more.
I just try to wrap my head around with an example:
- Service A uses HTTP library H to call service B.
- In Service A there are 2 scopes, one for the "app" and one for the http library, in service B there is one scope.
- So similar to your example above there are now five spans
Server Span 1 in scope A -> Internal Span 2 in scope A -> Internal Span 3 in scope H -> Client Span 4 in scope H -> Server Span 5 in scope B - I do not care about span 2 + 3 (which may be multiple internal spans)
- If I follow what you have outlined above the current scoping A, H, B would not work so I would need to put those internal spans into their own scope (e.g. "internal A scope" and "internal H scope").
Is this a correct understanding?
Yeah you're correct. In my experience most instrumentation produces a single tier of spans. I know the idea of a single instrumentation scope producing both server + internal spans, or internal + client spans exists, but maybe those instrumentations would rethink that position if the SDK gave them tools that gave them an incentive to do so. Or maybe there is a need for additional scope level configuration beyond just the enabled or disabled.
I think adding the ability to enable / disable scopes is doing the simplest possible thing while addressing a glaring omission in SDK configuration. Its the low hanging fruit. After that, we can take stock and determine what else is missing.
Or maybe there is a need for additional scope level configuration beyond just the enabled or disabled.
I think disable/enable is a great starting point, and I see another usecase: there are plenty of questions around disabling individual instrumentation libraries. Of course one could argue that not loading them in the first place is what you should be doing, but if you start with a Java Agent, or a Node JS metapackage that has ALL the instrumentation libraries in it, it can be easier to disable those you don't care about. Now, if an instrumentation library has a well-defined scope, you could simply turn that off, and you are done.
The screenshot in https://github.com/open-telemetry/opentelemetry-demo/issues/659 is a good example of this use case. Of course boiling this down to the 3 spans you "really" need is a good solution, but what if I suspect an issue in Slim... or Psr... or I am a library developer and want to get that level of insight, if I can turn those scopes on/off, I have that level of flexibility without compromising on span creation.
It would be worth getting @lmolkova's input, she has been (and is) thinking about this problems. The problems she highlights are mostly focusing on nested client spans (a client span parenting another client span), but I think the solution proposed here could also apply to those cases.
- open-telemetry/oteps#172
- open-telemetry/semantic-conventions#674
Using the scope mechanism is a much more promising approach then inventing a new mechanism, as proposed in the OTEP linked above.
I'd like to add that .NET OTel has an existing mechanism for it.
Let's say instrumentation creates a scope (via ActivitySource aka tracer) called My.Source.
When you want to enable this instrumentation, you enable a source via traceProvider.AddSource("My.Source").
Let's say library also creates a scope called My.Source.LowLevelVerboseStuff. Users may (but don't need to) enable it as well, but separately.
If I want to enable all sources from a library, I can do it with traceProvider.AddSource("My.Source*") (which introduces some small but solvable problems like this one)
The same mechanism is used for meters.
Effectively source name acts like a logger name where you can control loggers by their namespaces.
The changes in the spec that I'd like to see to generalize this approach
- OTel SDK should allow enabling/disabling specific instrumentation scopes (by name?) for traces, metrics, and logs
- Instrumentation scope within one library/instrumentation should be granular enough to allow controlling scopes independently
- Instrumentations MUST document the scopes they emit
- Instrumentations SHOULD use the same scopes names across signals for the same instrumentation points (e.g. http spans and http metrics)
- [Update] It could be a good idea for semconv to recommend a specific set of scopes
- If instrumentation scope is disabled
- OTel SDK MUST return non-recording span (
null, or other language-specific convention can be used) - MUST NOT create new span context
- MUST NOT change active/current context
- OTel SDK MUST return non-recording span (
- OTel API SHOULD provide a way for instrumentations to check if scope is enabled before collecting pre-sampling attributes. This is necessary to allow writing verbose layers that also affect performance.
Nice-to-haves
- I'd be great to introduce
SamplingResult.Dropdecision as well (plus pass instrumentation scope to samplers) - it should result in the same behavior as if the whole scope was disabled but enable more granular per-span suppression
Per-scope enablement vs verbosity
As I mentioned before per-scope enablement is like a logger configuration, but there are two modes: on and off. I think we can add more modes later by introducing a new property on the source (verbosity or similar) with some default value and adding config options to enable all scopes >= x verbosity. But this would be an incremental improvement.
In the meantime we can just ask instrumentations to document their stuff and users to read docs. Distros would also need to pick good defaults.
One more example where this could be extremely useful is messaging where we have
- per-message span to inject unique context and make each message traceable independently
- send span tracing send operation for a batch of messages.
This creates problems for some users:
- Many applications always send one message in a batch even though internally SDK treats everything as a batch.
- Or they don't really need per-message tracing (when doing stream processing/aggregation)
- Or their batches are so big that per-message tracing becomes prohibitively expensive (perf and volume of spans)
The suppression mechanism could allow users to decide if they want per-message tracing and enable it then. Or if they are fine if all messages share the same span context (in which they were sent).
This brings one more small requirement (which is probably already covered, but maybe worth mentioning anyway):
- instrumentations SHOULD be able to access and inject the context (theirs or their parent's) even if the span they created is suppressed/disabled
Thank you @lmolkova for this detailed plan on how to tackle this issue. A question, it looks like this supersedes https://github.com/open-telemetry/opentelemetry-specification/issues/3205? For the sake of having less issues, I am happy to close it then.
I think disable/enable is a great starting point, and I see another usecase: there are plenty of questions around disabling individual instrumentation libraries.
This is a use case I'm dealing with right now where it could be very useful to disable/enable a specific library instrumentation dynamically. My use case involves remote configuration for Android applications where users can decide what to disable/enable after they've deployed their app, and since library instrumentations are added at compile time in Android, unloading an instrumentation at runtime is not an option, so I think this proposal can help addressing that use case in a clean way too.
Thank you @lmolkova for this detailed plan on how to tackle this issue. A question, it looks like this supersedes #3205? For the sake of having less issues, I am happy to close it then.
@svrnm I believe this issue only covers on/off, but not the formal definition of verbosity per se. If this one is resolved, #3205 would become much more straightforward, but still not covered. So I'd rather keep #3205 open (unless you feel that on/off is good enough).
I think this is also related to some stuff I'd been kicking around for years now,
https://github.com/open-telemetry/opentelemetry-specification/pull/2555 https://github.com/open-telemetry/opentelemetry-specification/pull/3103 https://github.com/open-telemetry/oteps/pull/186
I planned to attempt again to find a solution in a rebooted Convenience SIG/WG. At least I think it would fall under that SIG/WG.
Is there any interest for tackling this there?
This approach sounds good to me. Note that this approach does not exclude Sampler-related mechanisms that might solve the problem of broken traces due to sampled-out intermediate spans.
@tsloughter my immediate interest is to add the simplest possible scope level config mechanism across signals, by just providing a way to enable / disable scopes. I think that once a scope level config mechanism exists, its more straight forward to pursue something like #2555, but also, the need diminishes because at least part of the motivation is already accommodated by disabling scopes.
I think that #3103 is somewhat related, but should be separately pursued.
As for https://github.com/open-telemetry/oteps/pull/186, there's definitely overlapping motivation, but I question the value of having all of the tracerprovider configuration options configurable at the tracer level. I think it makes sense to consider each property on a case by case basis. This will be easier to do once a scope level config mechanism exists.