openzipkin.github.io icon indicating copy to clipboard operation
openzipkin.github.io copied to clipboard

Converge discussion around B3 and TraceId/SpanId

Open codefromthecrypt opened this issue 9 years ago • 13 comments
trafficstars

http://zipkin.io/pages/instrumenting.html discusses propagation, in terms of http and thrift, as well parent id vs span id, etc.

There are several aspects around propagation that should be highlighted independently before being bound to a specific propagation carrier, such as a binary field or http headers.

For example, the following fields are used in propagation, even if not all are stored. Particularly things like 'debug' vs 'flags' are hard to understand.

Here are some useful things discovered and documented making Brave's binary form match Finagle's.

SpanId

Key fields propagate together, even if they are sent in http as separate headers. It is useful to think of them as a unit named SpanId or TraceId, regardless of if the propagation is in-process or not:

  • spanId - Unique 8-byte identifier of this span within a trace.
  • parentId - The parent's spanId or null if this the root span in a trace.
  • traceId - Unique 8-byte identifier for a trace, set on all spans within it.
  • flags - Like sampled or debug

Not necessarily obvious uses of SpanId

Efficient and consistent logging key

Both Finagle and Brave have very efficient toString forms of this, which can make log searches easier. The format is $traceId.$spanId<:$parentId, and ends up looking like this 0000000000000001.0000000000000003<:0000000000000002

Alternative to "passing a span around"

The above compound key can be used as an alternative to "passing a span around". For example, in Finagle, this is used as a key in a map that has a mutable span. Instrumentation add to this map, until it is converted into a transport object for reporting.

The Debug Flag

In all known propagation (ex both http and binary), flag bit 0 is the debug flag. For example, a flag value of 1 means this trace should pass any sampling, instrumentation or collection side.

Special Cases in Binary Encoding

Binary encoding is fixed-width 32 bytes

The binary structure of the above fields is 32-bytes, and this mean some encoding tricks as you need to know the difference between 0 and unset or null.

Most importantly, you can't just read the flags as 0 or 1 for a debug decision! For example, 3 is also debug, because in both cases bit 0 (FLAG_DEBUG) is set.

Root Span

  • In systems like finagle, where the trace id is always a span id, spanId = parentId = traceId means this is the root span.
  • In systems where a trace id is not a span id, a separate flag is used to ignore the value of the parent id, bit 3 of flags indicates you should ignore the parent id as it is a root span.

Sampled Flag

Flags are bits that can either be zero or one. However, the act of sampling is that there are three values: Sampled, Don't Sample, or Don't know. The latter is not a well documented option, but it does exist. In order to tell the difference between yes, no and don't know, we need 2 flags.

  • Flag bit 1(FLAG_SAMPLING_SET) indicates whether Flag bit 2 (FLAG_SAMPLED) should be interpreted as a sampled decision. If bit 1 is 0, then you don't read bit 2.

codefromthecrypt avatar May 11 '16 01:05 codefromthecrypt

cc at least @chimericalidea @kristofa @abesto @mikewrighton @michaelsembwever

codefromthecrypt avatar May 11 '16 01:05 codefromthecrypt

Q: what is the interpretation of Don't Know value for Sampled, where it exists? Given that Sampled is usually used to decide whether to store/not store the trace, a 3rd value is odd.

yurishkuro avatar May 11 '16 02:05 yurishkuro

In finagle, they say "none means we defer decision to someone further down in the stack."

https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Id.scala#L154

codefromthecrypt avatar May 11 '16 02:05 codefromthecrypt

Looks like this is used in case the sampler failed or something? https://github.com/twitter/finagle/blob/fc321f804a22a695ec419902505c8509ffbd594d/finagle-zipkin/src/main/scala/com/twitter/finagle/zipkin/thrift/Sampler.scala#L80

@mosesn any more context on sampled = None

codefromthecrypt avatar May 11 '16 02:05 codefromthecrypt

Hmm, I'm not 100% sure, but I think it's because when we first create a TraceId, we haven't made a decision on whether to sample or not yet:

https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Trace.scala#L141-L152

We might be able to make that decision sooner, and I don't think we need the third state in the wire protocol, imo.

mosesn avatar May 11 '16 03:05 mosesn

Actually, do we need to encode it at all? If we use a protocol with optional headers, we can simply not encode the header to signal "off", and encode it to signal "on".

mosesn avatar May 11 '16 03:05 mosesn

I think this is more an encoding issue in the fixed-length binary encoding of the Trace Id https://github.com/twitter/finagle/blob/fc321f804a22a695ec419902505c8509ffbd594d/finagle-core/src/main/scala/com/twitter/finagle/tracing/Id.scala#L103

Absence of header as being the same as don't sample might take some thinking through

codefromthecrypt avatar May 11 '16 03:05 codefromthecrypt

Yeah, migration would be a pain in the ass.

mosesn avatar May 11 '16 03:05 mosesn

Well like you said, if we make it false unless specified, we imply a strict coupling of id provisioning and sampling.

Right now, clients like a browser plugin can send a trace ID without indicating it is debug or otherwise needs to be sampled.

I think the important thing is to document what this is now, since it is the case that sampled is optional and doesn't imply don't sample.

Then, we could open an issue for a version of propagation that changes this.. like default to unsampled etc.

SG?

codefromthecrypt avatar May 11 '16 04:05 codefromthecrypt

Yeah, seems reasonable. So to make sure we're on the same page:

Some(true) // always sample / debug
Some(false) // never sample
None // implementation can choose whether to sample or not

That seems about right?

mosesn avatar May 11 '16 05:05 mosesn

yep, except the debug part, since that's different flag :)

From a code POV, you are correct. the sampled() method includes the debug flag in its decision

codefromthecrypt avatar May 11 '16 05:05 codefromthecrypt

Am I right in thinking that flags are only used by logic in the instrumentation code, not anywhere in the Zipkin backend?

mikewrighton avatar May 11 '16 20:05 mikewrighton

Correct except that the debug flag is stored as Span.debug

codefromthecrypt avatar May 11 '16 23:05 codefromthecrypt