opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Discuss how to handle multiple agents trying to modify the tracecontext headers

Open kamtschatka opened this issue 3 years ago • 9 comments

What are you trying to achieve?

Propagation of TraceContext Headers should be working properly in case 2 wrappers are available on a webpage.

Additional context.

I have posted this already in the Trace Context issue tracker (https://github.com/w3c/trace-context/issues/451) and the OTEL JS issue tracker (https://github.com/open-telemetry/opentelemetry-js/issues/2295) but I was sent here, because apparently this is a spec issue.

Here is the description once again:

2 agents are on the page(OT and NewRelic) OT calculates the trace-parent header and a CRC value NewRelic modifies the trace-parent header, causing the CRC check to fail The problem here seems to be NewRelic, but if you see it more general, the same thing can happen to OT, because OT JS is also not considering any already existing trace-parent headers and just sets a new one, leading to the same outcome. With the current state of any implementation on the markt, I am guessing this is the same.

As we are also concerned from our side about this scenario, I would like to trigger a discussion here, because every Vendor would need to implement a solution for this to make it truly interoperable. My suggestion would be that the trace-parent header should be reused in case an XHR/fetch wrapper encounters this header instead of overriding it. The vendors themselves are then responsible to implement their own way of dealing with these cases. My idea would be to just store the trace-parent header in your internal datastructure and use it as an additional information to find the server side path for it.

I have now focused on the trace-parent header as this is not designed to have multiple values, but the tracestate header is also affected to some extent.

I'd be interested in your thoughts and ideas on how to solve this problem.

kamtschatka avatar Jul 20 '21 07:07 kamtschatka

I have now focused on the trace-parent header as this is not designed to have multiple values, but the tracestate header is also affected to some extent.

The tracestate header is designed to be unambiguously matched to a traceparent header, so I would say it is affected just as much (for example, it might not be possible to interpret some tracestate entry without knowing the trace ID from the traceparent).

Oberon00 avatar Jul 20 '21 07:07 Oberon00

The solution used previously was that each vendor uses their own header name. The problem of overriding each other came to be exactly because we standardized on a single header name. I think one solution may be to have a unique header prefix per "participant"/"tenant". E.g. you could set OTEL_CONTEXT_NAMESPACE=abc123 and it would then use traceparent-abc123 and tracestate-abc123 as header name.

I think this solution is better than trying to munch together multiple traces into a single traceparent header (for example Vendor 1 might already have a different trace ID from some incoming legacy tracing header, while OTel, having not interpreted that header, has chosen a different one; now it's impossible that they both use the same traceparent header).

Oberon00 avatar Jul 20 '21 07:07 Oberon00

I mean sure that would be an option, but then the question is what is the point of having a standard if nobody is using the standard? Or if the standard is only followed until a problem is discovered. A big advantage in the future (hopefully) is, that more and more vendors will allow those trace context related headers through firewalls, reducing the very common problem of having headers removed, which will result in broken traces.

According to the spec, the headers MUST be passed on and they MAY be modified (https://www.w3.org/TR/trace-context/#mutating-the-traceparent-field) Right now what happens is the "restart trace" action, but if I understand it correctly, this is reserved for front gates of secure networks, so that doesn't really apply to OTEL.

kamtschatka avatar Jul 20 '21 11:07 kamtschatka

I agree with @kamtschatka whatever we decide has to be standard compliant. I think the only way to solve this is to have some sort of global vendor API so that agents can detect and communicate with each other. This would likely involve collaboration and code changes in otel and in the vendor agents.

dyladan avatar Jul 20 '21 13:07 dyladan

In the end, if you can avoid it, you should not have multiple agents trying to do the same thing in the same process (same webpage).

Oberon00 avatar Jul 20 '21 15:07 Oberon00

@dyladan How did you envision that global API? Would the API just make the traceparent contexts available while inside of a wrapped XHR function? I am concerned that this would make us dependent on finding an agreement on yet another API, when the XHR object already offers the setRequestHeader function where someone can set those headers for a specific XHR.

@Oberon00 That would be the ideal scenario, but this is not going to happen ;-) We regularly see customers who have more than 1 tracing vendor on their page. It also goes against the whole idea of the TraceContext standard that is designed to make multiple vendors work together.

kamtschatka avatar Jul 26 '21 09:07 kamtschatka

It also goes against the whole idea of the TraceContext standard that is designed to make multiple vendors work together.

My impression is rather that tracecontext makes multiple tracing vendors working together harder, but it makes the job easier for cloud, proxy, network software vendors who have a standardized header name they can propagate from incoming to outgoing requests.

Oberon00 avatar Jul 26 '21 12:07 Oberon00

Implementing the W3C trace context spec is not hard if there is only 1 vendor on the page. Setting the headers on XHR and fetch requests can be done easily.

It gets complicated when multiple vendors are on the page, wrapping the same XHR/fetch functions:

  • If a traceparent header is set multiple times (once per vendor), the header gets concatenated with “,”, causing it to be invalid, breaking correlation with the server side
  • Alternatively if Vendor1 is calling a function on the wrapper of Vendor2 and Vendor2 decides to suppress/ignore that call, then this will mean correlation between the server side and Vendor 1 is broken, because the traceid/spanid is not propagated.

In theory it is possible to change the order of the scripts on the page to favor Vendor1 over Vendor2, but that just moves the problem to another vendor. It would in theory also be possible to not support this scenario, but from experience, it is very likely that 2 Vendors are used to monitor a web application and customers are not willing to remove one or the other.

So ideally there is a way to make all Vendors work together properly, to ensure proper correlation between client side and server side.

There are 2 options to make that happen:

  • All Vendors have a way to agree on the same traceid/spanid:
    • Vendor1: traceid = xxx, spanid = yyy
    • Vendor2: traceid = xxx, spanid = yyy
    • XHR: traceid = xxx, spanid = yyy
  • Vendors have a way to figure out if another Vendor overrode the traceid/spanid. This way the Vendor who had the traceid/spanid overridden can either:
    • update the traceid/spanid they report:
      • Vendor1: traceid = xxx, spanid = yyy
      • Vendor2: traceid = xxx, spanid = yyy
      • XHR: traceid = xxx, spanid = yyy
    • or store them in additional fields:
      • Vendor1: traceid = xxx, spanid = yyy
      • Vendor2: traceid = aaa, spanid = bbb, othertraceid = xxx, otherspanid = yyy
      • XHR: traceid = xxx, spanid = yyy

Updating the internal state with a new traceid/spanid might not be possible or not desirable though, as it might have been propagated/referenced by a span already, which would break the references.

XMLHttpRequest

For monitoring the XHR object, we have so far seen 3 approaches:

  • Modifying the “open”, “send”, … functions on the original XHR object
  • Overriding the XHR object with a completely new object
  • Using the Proxy API

One of the big limitations with the XMLHttpRequest objects is that it only has a setRequestHeader function, but no way to get the headers out again. The Vendor closest to the original XHR object has ultimately full control over which headers get set(or if there is no special implementation, setRequestHeader will be called multiple times with the same header name, leading to concatenation), the other Vendors have to blindly trust that the header gets actually set. This is not very reliable though (and more and more Vendors might be tempted to prevent setting the traceparent header twice to prevent completely broken correlation)

To solve the issue we came up with a few options:

  • Making sure all vendors can use the same traceid/spanid:
    • If a Vendor detects that setRequestHeader was called with a tracecontext, it updates its own internal state or stores that tracecontext on another property and passes the incoming tracecontext on.
      • Advantage:

        • Setting the tracecontext header is easy-->simply call setRequestHeader
      • Disadvantage:

        • Handling of setRequestHeader is necessary and non-trivial
        • The called Vendor needs to actually be able to update its own state with the incoming tracecontext
        • If that is not possible, the called Vendor needs to store the incoming tracecontext on another property, which will make querying data harder, because there are 2 fields with a traceid/spanid (e.g. “select spans where traceid=clientspan.traceid or traceid=clientspan.passedInTraceId)
      • Vendors can agree on the same traceid/spanid when creating a new XHR object by performing a lookup in a map and using the values returned or putting the tracecontext in the map:

        • Pseudocode:
        const tracecontext = lookupInMap(xhr);
        // continue with vendor specific code
        
        function lookupInMap(xhr) {
          let traceContext = globalMap.get(xhr);
          if (traceContext) {
            return traceContext;
          }
          traceContext= generateTraceContext();
          globalMap.put(xhr, traceContext);
          return traceContext;
        }
        
        • Advantage:
          • all Vendors use the same tracecontext → correlation with the server side is trivial (e.g. “select spans where traceid=clientspan.traceid”)
        • Disadvantage:
          • What to use as the key for the map? If the whole XHR object gets replaced by a vendor, the XHR object can’t be used as key. Could be circumvented by providing a function to get the original XHR object (has it’s own drawbacks as it would theoretically allow working around a certain vendor) or providing a function that returns some kind of symbol that can be used for a lookup.
          • Vendors need to actually be able to use this tracecontext. If the XHR is created as part of an existing trace, this is most likely not possible.
          • If that is not possible, Vendors need to store the “universal” tracecontext on another property again
      • Vendors provide a special “setTraceContext” function for other vendors to call. The called Vendor can the either reuse the passed in tracecontext or reject it and return a new tracecontext to the caller.

        • Advantage:
          • Solves the problem that Vendors need to agree on a key in the Map for looking up the tracecontext
        • Disadvantage:
          • The caller Vendors need to actually be able to use this tracecontext. If the XHR is created as part of an existing trace, this is most likely not possible.
          • If that is not possible, the caller Vendors need to store the tracecontext on another property again.

Fetch

For fetch calls it is a lot easier, because headers are passed to the fetch function in an object.

This leaves the following options:

  • The called Vendor reuses the incoming tracecontext
  • The calling Vendor checks the object for modifications and updates the internal state (not possible if that tracecontext was already propagated somewhere else)
  • The calling Vendor checks the object for modifications and stores the modified tracecontext on a new property

Potential issues:

  • This would only work if the passed in object is modified in place and not completely replaced by another object down the line though, so that would require some convention to only perform in-place modifications.
  • If customer code uses Object.freeze on the object, modifying in place is not possible
  • Customer code might run into issues if the passed in object gets modified by a Vendor

This is in no way a complete list of issues/advantages/disadvantages, but should be a good basis for further discussions between Vendors, to find a solution where everyone can benefit in multi-vendor scenarios, instead of starting a fight over injection order and incompatibilities.

kamtschatka avatar Feb 21 '24 12:02 kamtschatka

I fear that agreeing on traceId/spanId across different vendors/tools wont work in general. One tool might group two requests on one trace wheres the other prefers separate traces. As a result traceId must be the same for the one but different for the other.

What would be needed is a multi tenant propagation format but that's not the case for W3C traceparent. And the initial post indicates that w3c doesn't seem to adapt this as they see this as a problem to solve of the users.

W3C tracestate on the other hand is multi tenant capable.

But also for tracestate the current implementation of OTel propagators is problematic because they just write the value from OTel point of view. If another tool has written tracestate already it's overwritten. The propagators inject API doesn't provide read access to the carrier therefore Propagators can't merge even if they would like to do.

Flarna avatar Feb 23 '24 18:02 Flarna

After discussing this issue, we will be closing it as out of scope. Fundamentally, our preference would be for vendors to implement options that allow them to 'play nice' with OpenTelemetry.

austinlparker avatar Jun 18 '24 20:06 austinlparker

Wouldn't everybody like that? Shouldn't there at least be some guidance on how that could look like?

kamtschatka avatar Jun 18 '24 20:06 kamtschatka

@austinlparker Could you please provide some more details from that discussion? I find it a bit unfriendly to expect everyone to act as a friendly neighbor without OTel even trying to do the same.

Flarna avatar Jun 19 '24 06:06 Flarna

I can elaborate a bit more on our rationale -

  • First, the long-term goal of OpenTelemetry is to become 'built in' to underlying frameworks and libraries. Our ideal future state is one where users don't need to use 'agents' that could doubly-instrument code at all.

  • Second, we'll gladly accept PRs to specific instrumentation libraries and frameworks to aid in compatibility with existing agents (there have been several for the Java agent, for example).

It would be a pretty big lift to try and define spec-level affordances for things like this given the differences from language to language in how instrumentation hooks function.

If there's sufficient demand from users then we can of course re-evaluate this issue, but for now, we believe it's out of scope. If you do believe it's germane, I would suggest creating a new issue and bringing it to the specification SIG to present a proposal.

austinlparker avatar Jun 19 '24 15:06 austinlparker

Here's the OpenTelemetry Java agent's position on Running agent alongside other java agents - DataDog, AppDynamics, NewRelic...

this is not something we plan to officially test/support, but we will accept PRs to make it run better alongside other agents.

(https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/1534#issuecomment-812939452)

trask avatar Jun 19 '24 16:06 trask

Truly a shame, as this is a missed opportunity to bring different people from different vendors together to work on a solution together... I am pretty sure vendors will not make sure they work with every other vendor our there and since we don't agree on a way to solve this for everyone, everyone will have their own solution that is easy to implement and sure to work for them. Our only choice will be to block all the traceparent headers someone down the line tries to set, to avoid broken contexts for us (if our script is the first one on the page) and we'll defer blame to anybody up the chain for breaking the traceparent header and setting an additional value, as there is no other way now.

kamtschatka avatar Jun 19 '24 16:06 kamtschatka

@kamtschatka this issue and discussions are very JavaScript specific, so it's hard for me to understand what specifically you want to change in the specification itself.

maybe it would help to work in the JavaScript SIG to explore solutions and come back to the specification once there's a better understanding of what specification changes are needed?

trask avatar Jun 20 '24 00:06 trask