opentelemetry-specification Standardize Server-Timing: traceparent "propagator" across vendors

What are you trying to achieve?

Multiple otel vendors have used HTTP Server-Timing headers to propagate server-side instrumentation context back to client instrumentation. I would like the otel specification to canonicalize the names, formats, and configuration options for this, and for the various otel implementations to accept donated implementations of this concept.

Additional context.

Client-side instrumentation (in the sense of web or mobile apps) may set outbound context via http headers which may be received by server-side instrumentation. However, there are a few cases where this breaks down:

users that want to keep a trust boundary between these domains and don't want untrusted clients to influence the way their server-side instrumentation behaves
initial page loads and resource loads in browsers (where javascript instrumentation can't influence)
users that don't want the added complexity/overhead of CORS preflight from browsers caused by adding headers to fetch/xhr requests

Multiple otel vendors have landed on a solution to the second point above, by using Server-Timing response headers generated by server-side instrumentation and received by client-side instrumentation.

A few links for your reference:

https://www.w3.org/TR/server-timing/ which is the spec for the server-timing header
https://caniuse.com/?search=server-timing (showing that this header is well-supported by browsers)
https://www.w3.org/TR/trace-context/#traceparent-header (the otel default for propagation)

Server-Timing response headers are keyed to a name (conceptually it could be used like "app=400, db=300, env=prod3"). Several otel vendors/contributors have indepdently used this in the fairly obvious way, where the key used is traceparent and the value is the full traceparent-format string. A complete example would be:

Server-Timing: traceparent;desc="00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"

Some existing examples from around the otel universe:

Grafana donated PHP contrib code treating the server-timing header as a "propagator": https://github.com/open-telemetry/opentelemetry-php-contrib/tree/main/src/Propagation/ServerTiming
Splunk code from our java otel distribution: https://github.com/signalfx/splunk-otel-java/blob/c73b94575488458c1d267af3514fb0db25e48935/custom/src/main/java/com/splunk/opentelemetry/servertiming/ServerTimingHeaderCustomizer.java#L45
Microsoft client-side code looking for the header: https://github.com/microsoft/ApplicationInsights-JS/blob/7f804d81e3036d5115c0c8e859dec5c4ce08b269/shared/AppInsightsCore/src/JavaScriptSDK/W3cTraceParent.ts#L191-L204

Each one uses the exact same "propagation" concept (traceparent value format, traceparent is the key name in the server-timing header). They do differ in configuration/setup, and they also differ in client usage - for example, Microsoft's product (to my knowledge) uses it on browser page load to set the actual trace context for the page load, while Splunk clients add the server-side trace context as a trace link to the appropriate client-side http client span. In my opinion the specifciation can sidestep this issue (directed usage of the propagated context) for now, or recommend making it configurable.

Questions towards a specification

What is this? A "propagator"? In this sense a user could set their configured propagators for server-side instrumentation to be tracecontext,baggage,servertiming and the servertiming propagator would propagate back to the client?
If it's not a propagator, how would it fit into the spec and how would it be configured on/off (e.g., environment variables)?

Jan 10 '24 18:01 johnbley

FYI @cedricziel as you expressed interest in writing a proposal for this in slack before

Jan 10 '24 18:01 t2t2

Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.

Jan 10 '24 18:01 mmanciop

Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.

In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes. 😞

Jan 10 '24 20:01 johnbley

Seconding @mmanciop here. We would love to see a sustainable and supported way to communicate server context to client side technology.

Server-Timing is widely used even beyond the implementations mentioned and I think OTel would benefit a lot from a specification of using it for the purpose of forwarding context to clients.

Jan 10 '24 20:01 cedricziel

Talk about timing! I was just discussing this yesterday with a few other folks and even opened this here: https://github.com/w3c/trace-context/issues/556

Here are a few notes based on my investigation so far:

we might want to use traceresponse instead of traceparent, which is defined in a draft of Trace Context. It's almost the same payload, except that it uses child-id instead of parent-id in one of the fields
on the OTel SDK side, we might need to split the notion of propagators that are sending data forward (request propagators), and propagators that are sending data back to the callers (response propagators). Otherwise, we'll send traceparent back to clients and traceresponse to servers. I'm working on a draft proposal for this and will be opening an OTEP soon.
a second OTEP would be a change to the SDK spec, so that all SDKs can implement a "backward propagator" (response propagator) for Server-Timing + traceresponse.
I believe that the W3C Trace Context WG should revisit the decision about the header name for traceresponse before stabilizing the draft, favoring the use of Server-Timing with a metric name "traceresponse" instead of a new "traceresponse" header. I gathered some evidence of usages of Server-Timing for our purposes in the linked W3C Trace Context issue, and I'm happy to see that @johnbley's research corroborates with it.

Jan 11 '24 08:01 jpkrohling

I like all of what @jpkrohling has to say. I like the idea of using Server-Timing: traceresponse since it is semantically clearer (again, if the client wants to use this as a parent, more power to it). I also really like the idea of a spec-level differentiation between request propagators and response propagators (though maybe not a config-level differentiation - we have enough environment variables as it is). It seems like it would enable other areas of innovation around sampling approaches or overall coordination among instrumentation code.

Jan 11 '24 16:01 johnbley

In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes.

To explain the impact to others landing here: ServerTiming is the only known way (AFAICT) in the RUM industry to reliably correlate document (page) loads or resource (pre)fetches with distributed traces, as the distributed tracer can inject the ServerTiming header in outgoing responses with values that represent the active trace context. And Safari is not playing ball as of Jan 2024 :-)

Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.

Jan 12 '24 14:01 mmanciop

Great point, @mmanciop. I was having problems understanding why we couldn't do it with our current solutions until @cedricziel showed me this diagram he created:

Jan 15 '24 09:01 jpkrohling

As mentioned in a previous comment, I was getting ready to propose a spec change related to this, and here's the draft I had. Note that I was breaking down the task in smaller chunks, the first one being expanding the notion of propagators so that we define what's a "client propagator". The next one, based on the outcome of the W3C Trace Context issue I listed earlier, would be to define the first client propagator based on traceresponse (either its own header, or as a metric of Server-Timing).

When working with client-side instrumentation, such as the ones being developed under the Client Instrumentation SIG, there’s currently no reliable way to obtain the trace context or any references to the trace generated by the backend during the initial document request on the client. While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span), other scenarios might still be hard to implement. For instance, the response of a backend might cause a re-render of a UI component, and currently, it’s not possible to link the trace related to that re-render to the root span of the backend trace unless the trace ID has been created by the frontend and reused in the backend.

This spec change proposal enhances the concept of propagators to differentiate between “backward propagators” (or response propagators) and “forward propagators” (or request propagators):

Our current propagators are what’s then going to be called “forward propagator”, given that they are intended to propagate the context to the next steps in the call chain. This is typically added to request headers, but we want to avoid the term “request”, as this is not bound to HTTP requests.
Backward propagators are a new concept, propagating the context back to the caller, to the previous step in the call chain. This is typically added to response headers in an HTTP scenario.

Without this differentiation, when implementing context propagation to clients, it would result in headers being sent in the request and response payloads that are not intended to be there, which might cause ambiguity, conflicts, and increased payload size. For example, an application configured with TraceContext propagator and a new hypothetical ClientPropagator might end up sending the following headers to all their outgoing requests to downstream services and to their responses to callers:

Server-Timing: traceresponse;desc=00-123-456-01
traceparent: 00-123-789-01

This spec change is agnostic to the payload and relates only to enhancing the definition of propagators. The payload that would be used in the first recommended backward propagator is still under definition by the W3C Trace Context working group and is, therefore, out of scope for this change.

Jan 15 '24 13:01 jpkrohling

Yes, we should support passing this via the Server-Timing headers for browsers that support it.

Jan 16 '24 17:01 MSNev

About

Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.

and

While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span)

Note that the addition of custom request headers in XHR/fetch instrumentations is prone to cause same-origin policy issues. This can be worked around using CORS, but this causes significant friction and is commonly misunderstood.

Ideally, a correlation solution does not have to (solely) rely on additional request headers. Server-Timing makes this more reliable and easier to deploy for users.

Jan 18 '24 14:01 bripkens

The W3C distributed tracing working group met with the Web Performance Working Group about exactly this today. Notes are in the tracking issue created by @jpkrohling https://github.com/w3c/trace-context/issues/556

The short version is that we are encouraged by the possibility of using server timing. The group had previously decided to define a custom header purely because server-timing was nascent, but the landscape is significantly improved now. The next step is for the tracing working group to translate its existing draft response header spec into a version which uses a server-timing metric.

Jan 30 '24 21:01 dyladan

biggest concern currently with server-timing is browser support. It is not available in safari or iOS currently, and according to https://caniuse.com/server-timing it is available for about 75% of users. After discussion with the web performance group, it seems that safari support is held back due to privacy concerns and is likely to be restricted to a same-origin policy regardless of CORS opt-in or timing-allow-origin.

Jan 30 '24 21:01 dyladan

@dyladan, what I understood from https://github.com/w3c/trace-context/issues/556 regarding this last concern is that we'd face the same challenges with Safari, so, we'd be in no better position if trace context would decide to have its own response header, right ?

Jan 31 '24 12:01 jpkrohling

Now, yes. In 2018 when the question was first considered the answer was less clear. I wasn't sharing it as a reason not to use server timing, just trying to make sure everyone was aware of the limitations.

Jan 31 '24 12:01 dyladan

@jpkrohling @johnbley The current traceresponse spec only enables the client to add a span link to the server side span, it does not enable the typical parent-child relationship of spans in a trace. Are you proposing that the spec change should enable both? If so, then it's the traceresponse spec that needs to enable both these scenarios (perhaps a trace-flag could indicate whether the span id provided is a child-id or parent-id). On the other hand, if you are suggesting that we do only span linking then that's fine too and would be simpler overall, as more options is more complexity.

Feb 21 '24 01:02 scheler

For completeness, there is another way to propagate the context back to the client for web document load, and that is by writing a meta tag in the HTML content. This is currently implemented in the OTel document-load instrumentation. This at least has the advantage of working on Safari as well.

Feb 23 '24 00:02 martinkuba

This seems non-trivial enough to need an OTEP with more details.

Apr 19 '24 18:04 tigrannajaryan

I think this is the status: https://github.com/open-telemetry/opentelemetry-specification/pull/3825#issuecomment-2066466968

Apr 19 '24 18:04 jmacd

Discussed in the 4/23/24 Spec SIG. Given that the problem is very related to browsers, it might be appropriate for the client SIG to work on this. I've set the status to triage:accepted:needs-sponsor, but can update if the client SIG wants to take this on.

Apr 23 '24 15:04 jack-berg

Given that I opened a PR for this already (https://github.com/open-telemetry/opentelemetry-specification/pull/3825), I'm OK being the sponsor.

Apr 23 '24 15:04 jpkrohling

Maybe out of scope for this issue (or the PR above raised by @jpkrohling ) but if response propagators were to be configured to propagate context back to callers, would this be a good use of tracestate in response headers, to propagate a low-cardinality http.route that can be used not only by browser clients, but also by HTTP proxies, or in fact any client, to apply a more informative name both on client (and server in case of proxies) spans?

Jun 17 '24 11:06 danielgblanco