Standardize Server-Timing: traceparent "propagator" across vendors
What are you trying to achieve?
Multiple otel vendors have used HTTP Server-Timing headers to propagate
server-side instrumentation context back to client instrumentation. I
would like the otel specification to canonicalize the names, formats,
and configuration options for this, and for the various otel implementations
to accept donated implementations of this concept.
Additional context.
Client-side instrumentation (in the sense of web or mobile apps) may set outbound context via http headers which may be received by server-side instrumentation. However, there are a few cases where this breaks down:
- users that want to keep a trust boundary between these domains and don't want untrusted clients to influence the way their server-side instrumentation behaves
- initial page loads and resource loads in browsers (where javascript instrumentation can't influence)
- users that don't want the added complexity/overhead of CORS preflight from browsers
caused by adding headers to
fetch/xhrrequests
Multiple otel vendors have landed on a solution to the second point above, by using Server-Timing
response headers generated by server-side instrumentation and received by client-side instrumentation.
A few links for your reference:
- https://www.w3.org/TR/server-timing/ which is the spec for the server-timing header
- https://caniuse.com/?search=server-timing (showing that this header is well-supported by browsers)
- https://www.w3.org/TR/trace-context/#traceparent-header (the otel default for propagation)
Server-Timing response headers are keyed to a name (conceptually it could be used like
"app=400, db=300, env=prod3"). Several otel vendors/contributors have indepdently used this in
the fairly obvious way, where the key used is traceparent and the value is the full traceparent-format
string. A complete example would be:
Server-Timing: traceparent;desc="00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
Some existing examples from around the otel universe:
- Grafana donated PHP contrib code treating the server-timing header as a "propagator": https://github.com/open-telemetry/opentelemetry-php-contrib/tree/main/src/Propagation/ServerTiming
- Splunk code from our java otel distribution: https://github.com/signalfx/splunk-otel-java/blob/c73b94575488458c1d267af3514fb0db25e48935/custom/src/main/java/com/splunk/opentelemetry/servertiming/ServerTimingHeaderCustomizer.java#L45
- Microsoft client-side code looking for the header: https://github.com/microsoft/ApplicationInsights-JS/blob/7f804d81e3036d5115c0c8e859dec5c4ce08b269/shared/AppInsightsCore/src/JavaScriptSDK/W3cTraceParent.ts#L191-L204
Each one uses the exact same "propagation" concept (traceparent value format, traceparent is the key name in
the server-timing header). They do differ in configuration/setup, and they also differ in client usage -
for example, Microsoft's product (to my knowledge) uses it on browser page load to set the actual trace context
for the page load, while Splunk clients add the server-side trace context as a trace link to the appropriate
client-side http client span. In my opinion the specifciation can sidestep this issue (directed usage of the propagated
context) for now, or recommend making it configurable.
Questions towards a specification
- What is this? A "propagator"? In this sense a user could set their configured propagators for
server-side instrumentation to be
tracecontext,baggage,servertimingand theservertimingpropagator would propagate back to the client? - If it's not a propagator, how would it fit into the spec and how would it be configured on/off (e.g., environment variables)?
FYI @cedricziel as you expressed interest in writing a proposal for this in slack before
Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.
Absolutely in favor of this. Pity ServerTiming is still not supported in Safari, though.
In safari js has access to it for xhr and fetch (possibly with Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes. 😞
Seconding @mmanciop here. We would love to see a sustainable and supported way to communicate server context to client side technology.
Server-Timing is widely used even beyond the implementations mentioned and I think OTel would benefit a lot from a specification of using it for the purpose of forwarding context to clients.
Talk about timing! I was just discussing this yesterday with a few other folks and even opened this here: https://github.com/w3c/trace-context/issues/556
Here are a few notes based on my investigation so far:
- we might want to use traceresponse instead of traceparent, which is defined in a draft of Trace Context. It's almost the same payload, except that it uses child-id instead of parent-id in one of the fields
- on the OTel SDK side, we might need to split the notion of propagators that are sending data forward (request propagators), and propagators that are sending data back to the callers (response propagators). Otherwise, we'll send traceparent back to clients and traceresponse to servers. I'm working on a draft proposal for this and will be opening an OTEP soon.
- a second OTEP would be a change to the SDK spec, so that all SDKs can implement a "backward propagator" (response propagator) for Server-Timing + traceresponse.
- I believe that the W3C Trace Context WG should revisit the decision about the header name for traceresponse before stabilizing the draft, favoring the use of Server-Timing with a metric name "traceresponse" instead of a new "traceresponse" header. I gathered some evidence of usages of Server-Timing for our purposes in the linked W3C Trace Context issue, and I'm happy to see that @johnbley's research corroborates with it.
I like all of what @jpkrohling has to say. I like the idea of using Server-Timing: traceresponse since it is semantically clearer (again, if the client wants to use this as a parent, more power to it). I also really like the idea of a spec-level differentiation between request propagators and response propagators (though maybe not a config-level differentiation - we have enough environment variables as it is). It seems like it would enable other areas of innovation around sampling approaches or overall coordination among instrumentation code.
In safari js has access to it for xhr and fetch (possibly with
Access-Control-Expose-Headers: Server-Timing) but not for docloads and resource fetches, yes.
To explain the impact to others landing here: ServerTiming is the only known way (AFAICT) in the RUM industry to reliably correlate document (page) loads or resource (pre)fetches with distributed traces, as the distributed tracer can inject the ServerTiming header in outgoing responses with values that represent the active trace context. And Safari is not playing ball as of Jan 2024 :-)
Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.
Great point, @mmanciop. I was having problems understanding why we couldn't do it with our current solutions until @cedricziel showed me this diagram he created:
As mentioned in a previous comment, I was getting ready to propose a spec change related to this, and here's the draft I had. Note that I was breaking down the task in smaller chunks, the first one being expanding the notion of propagators so that we define what's a "client propagator". The next one, based on the outcome of the W3C Trace Context issue I listed earlier, would be to define the first client propagator based on traceresponse (either its own header, or as a metric of Server-Timing).
When working with client-side instrumentation, such as the ones being developed under the Client Instrumentation SIG, there’s currently no reliable way to obtain the trace context or any references to the trace generated by the backend during the initial document request on the client. While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span), other scenarios might still be hard to implement. For instance, the response of a backend might cause a re-render of a UI component, and currently, it’s not possible to link the trace related to that re-render to the root span of the backend trace unless the trace ID has been created by the frontend and reused in the backend.
This spec change proposal enhances the concept of propagators to differentiate between “backward propagators” (or response propagators) and “forward propagators” (or request propagators):
- Our current propagators are what’s then going to be called “forward propagator”, given that they are intended to propagate the context to the next steps in the call chain. This is typically added to request headers, but we want to avoid the term “request”, as this is not bound to HTTP requests.
- Backward propagators are a new concept, propagating the context back to the caller, to the previous step in the call chain. This is typically added to response headers in an HTTP scenario.
Without this differentiation, when implementing context propagation to clients, it would result in headers being sent in the request and response payloads that are not intended to be there, which might cause ambiguity, conflicts, and increased payload size. For example, an application configured with TraceContext propagator and a new hypothetical ClientPropagator might end up sending the following headers to all their outgoing requests to downstream services and to their responses to callers:
Server-Timing: traceresponse;desc=00-123-456-01
traceparent: 00-123-789-01
This spec change is agnostic to the payload and relates only to enhancing the definition of propagators. The payload that would be used in the first recommended backward propagator is still under definition by the W3C Trace Context working group and is, therefore, out of scope for this change.
Yes, we should support passing this via the Server-Timing headers for browsers that support it.
About
Conversely, SPAs sending XHR requests have no problem correlating their requests with the corresponding distributed traces server-side, as the JS in the browser can inject the traceparent header in the outgoing XHR request, which is picked up and used by the server-side tracer.
and
While the client (browser, mobile app, …) might generate their trace IDs and send them via regular trace propagation mechanisms for correlation at the backend (like span links, or as the parent span)
Note that the addition of custom request headers in XHR/fetch instrumentations is prone to cause same-origin policy issues. This can be worked around using CORS, but this causes significant friction and is commonly misunderstood.
Ideally, a correlation solution does not have to (solely) rely on additional request headers. Server-Timing makes this more reliable and easier to deploy for users.
The W3C distributed tracing working group met with the Web Performance Working Group about exactly this today. Notes are in the tracking issue created by @jpkrohling https://github.com/w3c/trace-context/issues/556
The short version is that we are encouraged by the possibility of using server timing. The group had previously decided to define a custom header purely because server-timing was nascent, but the landscape is significantly improved now. The next step is for the tracing working group to translate its existing draft response header spec into a version which uses a server-timing metric.
biggest concern currently with server-timing is browser support. It is not available in safari or iOS currently, and according to https://caniuse.com/server-timing it is available for about 75% of users. After discussion with the web performance group, it seems that safari support is held back due to privacy concerns and is likely to be restricted to a same-origin policy regardless of CORS opt-in or timing-allow-origin.
@dyladan, what I understood from https://github.com/w3c/trace-context/issues/556 regarding this last concern is that we'd face the same challenges with Safari, so, we'd be in no better position if trace context would decide to have its own response header, right ?
Now, yes. In 2018 when the question was first considered the answer was less clear. I wasn't sharing it as a reason not to use server timing, just trying to make sure everyone was aware of the limitations.
@jpkrohling @johnbley The current traceresponse spec only enables the client to add a span link to the server side span, it does not enable the typical parent-child relationship of spans in a trace. Are you proposing that the spec change should enable both? If so, then it's the traceresponse spec that needs to enable both these scenarios (perhaps a trace-flag could indicate whether the span id provided is a child-id or parent-id). On the other hand, if you are suggesting that we do only span linking then that's fine too and would be simpler overall, as more options is more complexity.
For completeness, there is another way to propagate the context back to the client for web document load, and that is by writing a meta tag in the HTML content. This is currently implemented in the OTel document-load instrumentation. This at least has the advantage of working on Safari as well.
This seems non-trivial enough to need an OTEP with more details.
I think this is the status: https://github.com/open-telemetry/opentelemetry-specification/pull/3825#issuecomment-2066466968
Discussed in the 4/23/24 Spec SIG. Given that the problem is very related to browsers, it might be appropriate for the client SIG to work on this. I've set the status to triage:accepted:needs-sponsor, but can update if the client SIG wants to take this on.
Given that I opened a PR for this already (https://github.com/open-telemetry/opentelemetry-specification/pull/3825), I'm OK being the sponsor.
Maybe out of scope for this issue (or the PR above raised by @jpkrohling ) but if response propagators were to be configured to propagate context back to callers, would this be a good use of tracestate in response headers, to propagate a low-cardinality http.route that can be used not only by browser clients, but also by HTTP proxies, or in fact any client, to apply a more informative name both on client (and server in case of proxies) spans?