opentelemetry-specification Add ability for SpanProcessor to mutate spans on end

Fixes #1089.

In addition to the comments on the issue, this was discussed in the spec SIG Meeting on 2024/23/04:

The filtering use-case explained in this comment should rather be solved by the upcoming samplerV2 instead of SpanProcessors due to better conceptual fit
The buffering use-case also explained in this comment seems to be not relevant enough to influence the design decision
Apparently there was a discussion around building the SpanProcessors in a chaining fashion during the initial SDK spec design and it was actively decided against it. However, no one could recall the reason why.

Based on this input, I decided to not move the chaining-based solution forward and stay with the original proposal of adding a new callback to be invoked just before a span is ended.

I also decided to name the new callback OnEnding instead of BeforeEnd as suggested in this comment. The name OnEnding is more precise about when the callback is invoked.

A big discussion point in the SIG Meeting on 2024/23/04 also was the problem of evolving SDK plugin interfaces without breaking backwards compatibility. This PR contains a proposal to clarify how this should be handled: If the language allows it, interfaces should be directly extended. If not possible, implementations will need to introduce new interfaces and accept them in addition to the existing ones for backwards compatibility. I feel like this allow every language to implement changes to interfaces in the best way possible. Of course, changes to interfaces should still be kept to a necessary minimum.

I also wasn't sure whether this change warrants an addition to the spec-compliance-matrix, so I've left it out for now. Please leave a comment if you think this is required, then I'll add it.

Changes

Adds a new OnEnding callback to SpanProcessor
Add a paragraph on clarifying how languages should deal with interface extensions
[x] Related issues #1089
[ ] ~Related OTEP(s) #~
[x] Links to the prototypes (when adding or changing features)
- Java (Callback is still called beforeEnd in that PoC)
[x] CHANGELOG.md file updated for non-trivial changes
[ ] spec-compliance-matrix.md updated if necessary

Apr 30 '24 08:04 JonasKunz

What is the benefit of onEnding() compared to pass the read/write span to onEnd()? Both are sync, both are called at the same time (within the Span.End() api).

To my understanding both changes are breaking for SpanProcessor implementers. Either they have to add a method or change the type.

May 08 '24 14:05 Flarna

What is the benefit of onEnding() compared to pass the read/write span to onEnd()?

One benefit is that while order of span processor registration matters for onEnding(), it doesn't today and wouldn't with this change with onEnd(). Implementers of onEnd() can continue to rely on semantics where the they get an immutable span which cannot be impacted by other span processors.

To my understanding both changes are breaking for SpanProcessor implementers. Either they have to add a method or change the type.

We're working out the details of that in #4030, but I disagree that adding a new method to an SDK plugin interface is a breaking change. Languages vary in terms of how they expose these plugin interfaces and the language features for evolving interfaces, but there it should be possible for all SDKs to accommodate this type of change, albeit with some creativity in some cases. For example, its simple in java to add a new default method to an interface which existing implementers don't need to implement. This doesn't exist in some language like go, but there you can offer a new ExtendedSpanProcessor interface which includes the new method, and provide a new mechanism for registering that new span processor interface with TracerProvider.

May 08 '24 14:05 jack-berg

@JonasKunz I am curious to know whether you think it is essential for the on-end handler being discussed in this PR to make changes that are visible to other processors. In my opinion, it would be better if the changes made on-end applied to only the exporter associated with the processor making the change.

In order to satisfy the discussion in #4010, I feel that we should not call the object passed to the on-end handler a "read write span". The span object which is newly live when passed to the OnStart handler is a real span ("writable") and it is a span data object ("readable"). I do not think that processors should receive a real span on end (with which they could use real span APIs, for example, e.g., place the span's context back into a context and invoke a new child span).

I think it makes sense for span processors to receive span data objects on-end (i.e., not real spans, but data objects). If at that point a processor wants to modify the span, they can modify the span data object which they pass to their own exporter, but their changes will not impact other processor/exporters.

May 20 '24 21:05 jmacd

In my opinion, it would be better if the changes made on-end applied to only the exporter associated with the processor making the change.

I agree that, in the spirit of keeping consistency around the expected behavior of the SDK, it does make sense to follow a similar approach for whether to make processors affect other processors vs only their associated exporter. And I think #4010 has good points on it, to be honest, the existing behavior is quite strange so I think it's great that it's being revisited there.

However, I'm wondering if this PR is the right place to have this discussion, as it seems to affect SpanProcessors in a broader way, given that, should we decide in #4010 to switch to processors with independent pipelines, wouldn't it mean that (for SpanProcessors) we should address OnStart() as well? At least if we wanted to keep consistency in the overall SDK behavior anyway.

My understanding is that this PR is focusing on adding some extra functionality while keeping consistency with the existing patterns used in the rest of SpanProcessor, hence the "read write span" and the fact that a processor affects others down in the chain, even if it's not ideal, I think it's important to keep a consistent behavior. Now, this would also mean that, if in the future we decide to go with the changes proposed in #4010, then both OnStart() and OnEnding() should adapt to it imo.

May 22 '24 07:05 LikeTheSalad

I think there are usecases for both variants

I might have two backends and enrich spans depending on the backend
I might want to enrich spans in general

I guess something like a processing pipeline would be more flexible to allow putting one processor after the other. This would avoid that meed to implement a BatchEnrichingProcessor next to a SimpleEnrichingProcessor just to get enrichment and batching/non batching. If there are more orthogonal functionalities this can get quite cumbersome.

There are already SpanProcessors which mutate in onStart without exporting, e.g. here.

May 22 '24 08:05 Flarna

@JonasKunz I am curious to know whether you think it is essential for the on-end handler being discussed in this PR to make changes that are visible to other processors. In my opinion, it would be better if the changes made on-end applied to only the exporter associated with the processor making the change.

@jmacd In my opinion, this would remove the benefits this PR is intending to add: Making it easier for users to enrich spans in span processors before they are exported.

There is already a workaround for doing this with the existing onEnd() callback: Users can decorate/wrap the SimpleSpanProcessor / BatchSpanProcessor and wrap the incoming ReadableSpans in order to pass the mutated ones to the span processor invoking the exporter. We actually implemented this exact mechanism to provide a span processor which collects stacktraces as attributes for long running spans here. The problem with this approach that it requires a big amount of boilerplate code and doesn't play well with SDK autoconfiguration. This PR was an example use-case for why we wanted to drive #1089 forward for providing a simpler, SDK-native approach for enriching spans on end.

So the problem I see with making changes in onEnding not visible to other users is that it again pushes the responsibility to setup bootstrap pipelines to the user.

My take here is that there are two viable classes of solutions for allowing easy enrichment of spans via the SDK:

Adding a callback which mutates the actual span (like onStart)
Adding the capability to build pipelines which pass around immutable (but wrappable) span data where the terminal stage of the pipeline is some consumer (e.g. a span exporter)

This PR specs out solution category 1. In this comment I discussed both solution classes and also provided backwards-compatible PoC implementations for Java for both. However, the arguments for 2.) seemed to not be convincing enough (this was also discussed in a spec-SIG Meeting), I went for 1. in this PR.

We can definitely revisit this decision! However, if we do go for a pipelining approach, I'd propose to rather enhance the onEnd callback for it instead of introducing a new onEnding callback.

However, if we go for the pipeline approach, I'd propose to separate the construction of processors from the assembly of the pipeline by either:

Passing the next pipeline stage to call on invocation as a Function to onEnd
Introducing a separate init callback which receives the next pipeline stage as a function

The reason why I'm suggesting is that if we leave the pipeline assembly to the user via decoration/wrapping, this makes the pipeline structure a blackbox for the SDK.

If instead the pipeline assembly is managed by the SDK, it has more control over it: e.g. it could insert logging or telemetry between the pipeline stages if necessary. It also plays much better with autoconfiguration.

However, I'm wondering if this PR is the right place to have this discussion, as it seems to affect SpanProcessors in a broader way, given that, should we decide in #4010 to switch to processors with independent pipelines, wouldn't it mean that (for SpanProcessors) we should address OnStart() as well? At least if we wanted to keep consistency in the overall SDK behavior anyway.

I'm not sure whether onStart is a good candidate for pipelining. Pipelines make sense if you have some consumer at the end. While onStart might have such consumers (e.g. creating a metric about started spans), it usually is used to mutate spans before they are visible to the user/application code. I don't see how the second aspect fits nicely into a pipeline model vs just invoking onStart on all SpanProcessors in a loop.

May 22 '24 11:05 JonasKunz

Thank you @JonasKunz. I understand the motivation and agree with this change. Also, you've helped me understand how the change I'm looking for can be made orthogonally

@jmacd so if my understanding is correct, you are planning to add some kind of pipelining / chaining capabilities to OnEnd (or renamed OnExport), right?

I wonder if the addition of such pipelines would render the OnEnding callback proposed here useless and whether we should flesh out the pipelining first before moving ahead with this PR.

The main differences in terms of capabilities between OnEnding and the pipelining I can think of are the following:

OnEnding mutates the original span, these changes are therefore visible to anyone holding a reference to that span. I don't know if this has any effect though: The user application might be holding onto API-Spans, but those are write-only anyway to my understanding. So there wouldn't be any way for the application to actually observe those changes made in OnEnding
OnEnding is guaranteed to be executed on the application thread ending the span. We might not guarantee this for the pipeline stages
OnEnding is guaranteed to be invoked for every span (unless there is a bug in the instrumentation with dangling spans of course), this might not be the case for pipeline stages if we allow filtering there

So what do you think? Should we move ahead with this PR and get it merged or should we wait and check the overlap with the pipeline approach you are coming up with?

May 24 '24 11:05 JonasKunz

@JonasKunz I assumed that you mean for OnEnding() to receive a "read/write span object" the same as is passed to OnStart(). I think it means that processors can read (the data object) and write (the span object).

OnEnding mutates the original span, these changes are therefore visible to anyone holding a reference to that span.

The requirements, as written, I think ensure that callers are not permitted to use the span reference after End() is called, though the processors are still permitted to via direct access from OnEnding(). There's a bit of conflict with:

It SHOULD be possible to keep a reference to this span object and updates to the span SHOULD be reflected in it.

I'd probably tack on "SHOULD be reflected in it until End() is called on the Span".

For the last two bullets, I don't see a problem. I expect the OnEnd() callbacks all to execute before the export begins (a.k.a. OnEnd()). Nothing is stated about when the OnEnd happens, but after your change it should be clear that pipelining effects (whatever they are) begin with the OnEnd call. I think OnEnding() makes sense the way you have it -- and there are real use cases so we don't need to block for future design work.

May 24 '24 15:05 jmacd

Discussed this in the context of #4062

May 28 '24 15:05 jmacd

@pellared Would you say that if #4030 is accepted, this PR should be approached differently? (To me, it seems like the answer is yes.) I've speculated about what the solution might look like in https://github.com/open-telemetry/opentelemetry-specification/pull/4062#pullrequestreview-2086466483, which is briefly for SDKs to support a "FanoutProcessor" that gives users control over whether mutations are private to their export pipeline and/or visible to the next processor.

May 29 '24 22:05 jmacd

@pellared Would you say that if #4030 is accepted, this PR should be approached differently?

I think this PR is fine and it is a logical extension. We should only highlight that processor does not need to implement the newly added method for backwards compatibility.

May 31 '24 16:05 pellared

I just saw the Java prototype. It would be great if we also have a prototype in Go (currently I have no cycles to work on it).

Jul 04 '24 11:07 pellared

I just saw the https://github.com/open-telemetry/opentelemetry-java/pull/6367. It would be great if we also have a prototype in Go (currently I have no cycles to work on it).

@pellared Note that this prototype is outdated does not have the protection agains concurrent modifications we discussed. It will actually make the implementation even simpler. I'll update it to have the intended protection.

Jul 04 '24 11:07 JonasKunz

@pellared Nevermind my previous comment, my prototype already had the protection even though it wasn't there intentionally. I didn't notice we were already holding the lock while invoking onEnding. I just updated the prototype to reflect the naming of the callback suggested in this spec-PR.

Jul 04 '24 12:07 JonasKunz

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Jul 12 '24 03:07 github-actions[bot]

@pellared I'd say we are ready to merge as this is in-development - but yes, once we want to go stable, we will need a few prototypes actually. WDYT?

Jul 12 '24 10:07 carlosalberto

This has enough approvals and as it is experimental, merging it will help us motivate SIGs to prototype this. Will merge by EOD if there are no concerns.

Jul 24 '24 13:07 carlosalberto