spec
spec copied to clipboard
Clarify usage of Distributed Tracing Extension by OpenTelemetry
Background
We have a working group in the OpenTelemetry project that is making progress towards defining conventions around tracing for messaging systems.
As part of this effort, the group is defining how to trace applications that use CloudEvents. There is a PR open in the OpenTelemetry specification (Introduce new semantic conventions for CloudEvents) where you can see what the proposal is.
The PR addresses when to create spans, how the context will "flow" between event "hops" and how such spans should look like (name, attributes, type, etc). The goal of this is also to define a standard for Span structure. In theory, people looking at a trace for a "CloudEvents based system" should have the same experience everywhere.
In a nutshell, OpenTelemetry is going to heavily rely on the Distributed Tracing Extension
defined in this spec. Via this extension, CloudEvents libraries (SDKs, etc) that support OpenTelemetry, can inject the span context into the event, and later on extract this context to continue the trace.
Proposed Changes
During one of the meetings, @clemensv suggested that it would make sense that the Distributed Tracing Extension specification "links" to the OpenTelemetry semantic conventions. The idea is that people, (SDK authors or users) coming to read this spec will see the link to OpenTelemetry and will be able to read there and get a full picture of why this is here and how it should be used.
As we all know, this was cause for confusion in the past (Remove distributed tracing extension), as its purpose was not clear and how it related to tracing frameworks such as OpenTelemetry. Now that OpenTelemetry is defining its usage, CloudEvents users will benefit from it and hopefully no more confusions will arise.
Summary of changes
- Make it clear that this extension is what OpenTelemetry will use to trace CloudEvents applications
- Removed instructions on how/why this is used, as it's already defined by the OpenTelemetry specification
- Added a direct link to the relevant OpenTelemetry specification on the
Using the Distributed Tracing Extension
section
Questions for this group
- Should we rename this to
OpenTelemetry Distributed Tracing Extension
? Would this be a breaking change? - The OpenTelemetry semantic conventions for CloudEvents mentions that SDKs/Libraries should offer ways to automatically "manage" the Span creation for users (auto-instrumentation). But, since it makes not much sense to define behavior of what CloudEvents SDKs should have in the OTel spec, I think we should consider adding this in the CloudEvents spec somehow.
As an example, the Go SDK offers such capability: OpenTelemetry ObservabilityService. Users only have to opt-in to use this "service" and all their CloudEvents operations (Send, Receive) are traced automatically. It might not be possible to offer the same behavior in all SDKs, but at least we should aim to have something that makes users lives easier, using the language constructs available (middlewares, wrappers, handlers, etc).
Release Note
Clarify usage of Distributed Tracing Extension by OpenTelemetry
Note: Please take a moment to read through the PR open in the OpenTelemetry spec. If you have questions, concerns or don't agree with something there, leave a comment and consider joining the public meeting we have every Thursday (Instrumentation: Messaging)
@joaopgrassi thanks a ton for doing this! Hopefully this will address some of the concerns that have come up in the past.
Should we rename this to OpenTelemetry Distributed Tracing Extension? Would this be a breaking change?
IMO, I think it would be fine to do this rename if that's what the group wants.
We also use Distributed Tracing Extension in Azure SDKs and it would be great to see clarification on how it should be used in the CloudEvents spec.
I have a tiny concern about renaming it to OpenTelemetry Distributed Tracing Extension
. It currently relies on the W3C Trace-Context standard only and there is nothing OpenTelemetry-specific about it.
E.g. in Azure SDKs, we support legacy tracing libraries along with OpenTelemetry, and if someone would ever want to plug another tracing solution, they would still be able to do so and leverage Distributed Tracing extension on CloudEvents.
I have a tiny concern about renaming it to OpenTelemetry Distributed Tracing Extension. It currently relies on the W3C Trace-Context standard only and there is nothing OpenTelemetry-specific about it.
Yes, I agree. I also have mixed feelings about renaming it. I think keeping as-is is fine, as long as this document is clear on how it is used and people coming into this can find clear guidance. It can still be used by other means, but that's the user choice, so not much we can do at that point I'm afraid.
On the 12/02 call we approved the overall direction of this PR but we'll hold off merging it until the referenced doc is ready. Then we'll revisit it to make sure that between this PR and that doc everything looks good from an end-user's perspective.
@duglin thanks for this update!
One thing that I should bring up is that if the PR in the OpenTelemetry gets merged, it will still be marked experimental. The reason is that we are also working in the more general messaging conventions and finding out things "as we go". Since CloudEvents and Messaging overlap in some cases, it's likely that we'll need to make changes in the CloudEvents semantic conventions again as well. That then could also affect this document in the CloudEvent spec, but it's hard to say for now.
So we might consider only merging this once the CloudEvents semantic conventions in OpenTelemetry is marked as stable. What do you think?
@joaopgrassi it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you...
Great, will wait for the OT PR: https://github.com/open-telemetry/opentelemetry-specification/pull/1978 Then look at this PR.
I hope we include the HTTP header names with the CloudEvents spec here for clarity.
@duglin
it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you...
If it adds improvement to what we have today it's hard to say 😅 but it will def have to be adapted as we work on the OTel side. So let's leave it open for now, and I'll keep working on it.
I just found out that we actually have code that produces this ce-traceparent
.. so I also have to look for that. It is actually interesting because it just came up today during our OTel messaging meeting that we'll need some sort of header to propagate the initial creation context for messaging in general, and this seems like a good candidate for CloudEvents. This header helps avoiding having to unpack it to read the value of the extension, which could be costly.
Just to make it official... we're putting this PR "on hold" until the OT side of things are ready. People should feel free to still comment on it though.
Wanted to share my view: In the last ~6-7 years I've seen a fair share of clients and projects going for stronger choreography often mis-using especially Kafka as a message (queue) expecting "reliable processing" (whatever that means) and experiencing incredible difficulty tracking down anything leaving the happy path. Band-aids are usually hacked together and stuffed on it to somehow gain visibility & "control". Often falling back to using classic message queue metaphors (self-implemented dead-letter-queues with manual copy-back jobs because processing was not implemented in a proper idempotent way etc...). An enormous amount of effort & time goes down the drain for this; and it all starts with visibility.
Out of that I want to say: this is incredibly useful and please embrace it / improve on it. Writing down a clear convention how to transport (CloudEvents Extension) and how to link (OpenTelemetry) things to the original event creating span is excellent and the right way to do it.
There's a range of use-cases that are just not properly solved otherwise:
- (most common) suboptimal auto-instrumentation - and this can happen easily. In many situations it can be a good idea to have an outbox-pattern; you don't want to send out an event proclaiming a new state of the world and then fail internally. Here the event is logically created in the flow and within an RDBMS transaction and then sent out by some work-queue / "smart" in-house "best-practice" framework... so the technical tracing context, threads, etc... are suddenly different and in most cases context is not transported correctly
- Intermediaries / Proxies / Converters -> this is used as headline example by @joaopgrassi, just check out kNative Eventing & Dapr. I think it's safe to assume that this will grow to funnel data between clouds / products, etc...
- Batch send which is even supported by CloudEvents HTTP Binding and employed by many of those bespoke in-house "best-practice" frameworks trying to be "smart"
- Batch consume (obvious) - so you cannot correlate back in at this low-level
The only solution (and this here is similar with links but of course tackling it at the right level) I've seen doing it kind-of right has been Azure EventHub with the corresponding libraries. They also link to the original span showing it then in Azure Monitor. However it's incredibly badly documented, the only good explanation I found back then is on this medium article and inconsistent between e.g. C#, Java and node.js libraries... (or at least it was)
So I'd actually (once all is finalised and clarified, and judging from the last fine-adjustments I saw from @lmolkova I have a positive feeling) love to (optionally) see this support in the CloudEvents libraries / or a good hook that the OpenTelemetry libs hook into, because:
- encoding an object into CloudEvent standard is actually the moment to create the
CloudEvents Create <event_type>
SPAN and get that trace context into the CloudEvent - decoding a CloudEvent is actually the moment in which well processing starts so
CloudEvents Process <event_type>
and link your processing to the original CloudEvent Create Span
And yes, in the most simple scenario (HTTP direct call, no batching, no magic) you'd also see the chain in the normal HTTP W3C trace-headers and auto-instrumentation. But I think also in this scenario the additional link wouldn't hurt...
just my (very long) 5cents...
@joaopgrassi whats the status of things? It appears the OT PR might be merged
Hi @duglin Yes, the initial PR for adding CloudEvents semantic conventions in OTel was merged. The current status is: There is a group that is working on messaging in general, and trying to bring the messaging conventions to a stable 1.0 release. Because a lot of the things related to this overlaps with CloudEvents, I expect some things might change, specially around the trace structure. I'm part of the group, so I'm keeping tabs on it.
@joaopgrassi cool - thanks for the update!
This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale
.
Not stale, we are still working on bringing messaging conventions in OTel to stable.
/remove-lifecycle stale
This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
@joaopgrassi just a periodic ping to see if anything has changed
Hi @duglin ! So something did happen actually! We have been working on a OTEP that re-defines the general messaging systems semantic convention and that was merged last week. Now the work is to start porting the text from the OTEP proposal to the OTel specification and hopefully mark it stable.
Since CloudEvents have a lot of overlap with it, we have a tracking issue https://github.com/open-telemetry/semantic-conventions/issues/654 to investigate how the new conventions overlap/conflict and how we should proceed.
So, some movement, but nothing concrete for this yet. I will keep you/all posted on the progress.
@joaopgrassi great - thanks for the update
This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
I didn't forget about this. We are moving forward with the stabilization of messaging conventions in OTel, so getting closer to dig into this :)
@joaopgrassi thanks for the update!