openzipkin.github.io
openzipkin.github.io copied to clipboard
Converge discussion around B3 and TraceId/SpanId
http://zipkin.io/pages/instrumenting.html discusses propagation, in terms of http and thrift, as well parent id vs span id, etc.
There are several aspects around propagation that should be highlighted independently before being bound to a specific propagation carrier, such as a binary field or http headers.
For example, the following fields are used in propagation, even if not all are stored. Particularly things like 'debug' vs 'flags' are hard to understand.
Here are some useful things discovered and documented making Brave's binary form match Finagle's.
SpanId
Key fields propagate together, even if they are sent in http as separate headers. It is useful to think of them as a unit named SpanId or TraceId, regardless of if the propagation is in-process or not:
spanId- Unique 8-byte identifier of this span within a trace.parentId- The parent'sspanIdor null if this the root span in a trace.traceId- Unique 8-byte identifier for a trace, set on all spans within it.flags- Like sampled or debug
Not necessarily obvious uses of SpanId
Efficient and consistent logging key
Both Finagle and Brave have very efficient toString forms of this, which can make log searches easier. The format is $traceId.$spanId<:$parentId, and ends up looking like this 0000000000000001.0000000000000003<:0000000000000002
Alternative to "passing a span around"
The above compound key can be used as an alternative to "passing a span around". For example, in Finagle, this is used as a key in a map that has a mutable span. Instrumentation add to this map, until it is converted into a transport object for reporting.
The Debug Flag
In all known propagation (ex both http and binary), flag bit 0 is the debug flag. For example, a flag value of 1 means this trace should pass any sampling, instrumentation or collection side.
Special Cases in Binary Encoding
Binary encoding is fixed-width 32 bytes
The binary structure of the above fields is 32-bytes, and this mean some encoding tricks as you need to know the difference between 0 and unset or null.
Most importantly, you can't just read the flags as 0 or 1 for a debug decision! For example, 3 is also debug, because in both cases bit 0 (FLAG_DEBUG) is set.
Root Span
- In systems like finagle, where the trace id is always a span id, spanId = parentId = traceId means this is the root span.
- In systems where a trace id is not a span id, a separate flag is used to ignore the value of the parent id, bit 3 of flags indicates you should ignore the parent id as it is a root span.
Sampled Flag
Flags are bits that can either be zero or one. However, the act of sampling is that there are three values: Sampled, Don't Sample, or Don't know. The latter is not a well documented option, but it does exist. In order to tell the difference between yes, no and don't know, we need 2 flags.
- Flag bit 1(FLAG_SAMPLING_SET) indicates whether Flag bit 2 (FLAG_SAMPLED) should be interpreted as a sampled decision. If bit 1 is 0, then you don't read bit 2.
cc at least @chimericalidea @kristofa @abesto @mikewrighton @michaelsembwever
Q: what is the interpretation of Don't Know value for Sampled, where it exists? Given that Sampled is usually used to decide whether to store/not store the trace, a 3rd value is odd.
In finagle, they say "none means we defer decision to someone further down in the stack."
https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Id.scala#L154
Looks like this is used in case the sampler failed or something? https://github.com/twitter/finagle/blob/fc321f804a22a695ec419902505c8509ffbd594d/finagle-zipkin/src/main/scala/com/twitter/finagle/zipkin/thrift/Sampler.scala#L80
@mosesn any more context on sampled = None
Hmm, I'm not 100% sure, but I think it's because when we first create a TraceId, we haven't made a decision on whether to sample or not yet:
https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Trace.scala#L141-L152
We might be able to make that decision sooner, and I don't think we need the third state in the wire protocol, imo.
Actually, do we need to encode it at all? If we use a protocol with optional headers, we can simply not encode the header to signal "off", and encode it to signal "on".
I think this is more an encoding issue in the fixed-length binary encoding of the Trace Id https://github.com/twitter/finagle/blob/fc321f804a22a695ec419902505c8509ffbd594d/finagle-core/src/main/scala/com/twitter/finagle/tracing/Id.scala#L103
Absence of header as being the same as don't sample might take some thinking through
Yeah, migration would be a pain in the ass.
Well like you said, if we make it false unless specified, we imply a strict coupling of id provisioning and sampling.
Right now, clients like a browser plugin can send a trace ID without indicating it is debug or otherwise needs to be sampled.
I think the important thing is to document what this is now, since it is the case that sampled is optional and doesn't imply don't sample.
Then, we could open an issue for a version of propagation that changes this.. like default to unsampled etc.
SG?
Yeah, seems reasonable. So to make sure we're on the same page:
Some(true) // always sample / debug
Some(false) // never sample
None // implementation can choose whether to sample or not
That seems about right?
yep, except the debug part, since that's different flag :)
From a code POV, you are correct. the sampled() method includes the debug flag in its decision
Am I right in thinking that flags are only used by logic in the instrumentation code, not anywhere in the Zipkin backend?
Correct except that the debug flag is stored as Span.debug