trace-context icon indicating copy to clipboard operation
trace-context copied to clipboard

Consider trace context support for probability sampling

Open jmacd opened this issue 3 years ago • 2 comments

The OpenTelemetry project has been working to specify how it propagates information about probability sampling through a couple of OTEP drafts:

OTEP 168: Specify how to propagate consistent head sampling probability OTEP 170: Probability sampling: Sampler Name and Adjusted Count attributes

The first of these discusses how to propagate head probability so that each Span recorded in a Trace Context that has the sampled flag set knows its "adjusted count", which is the inverse of probability. We have proposed to use power-of-two sampling rates, following research by Otmar Ertl, and have come to see the use of a dedicated tracestate field as potentially too costly to have on-by-default.

Using tracestate means passing around 30 bytes per context, and considering this overhead we would like to see a Version-1 W3C traceparent with the addition of a couple of bytes of information. We can do this with 6 or 7 bits of information, ideally, but it will require specifying a lot more about traceparent and which bits of the TraceID are truly random.

This issue is a placeholder for raising this discussion in the W3C group.

jmacd avatar Aug 17 '21 17:08 jmacd

Since the existing parts of a traceparent are base16 encoded, and whereas the version 0 traceparent reads like

traceparent: 00-TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT-SSSSSSSSSSSSSSSS-FF

for T, S and F being half bytes of the TraceID, SpanID, and Flag parts.

The variation discussed by OpenTelemetry would be:

traceparent: 01-RRRRRRRRRRRRRRRRTTTTTTTTTTTTTTTT-SSSSSSSSSSSSSSSS-FF-PP

where R represents a half byte of true random TraceID (the most significant half), and P represents new information related to the head probability. The 64 available values for PP are recognized as negative base2 logarithm of the sampling probability:

0: an adjusted count of 1 (i.e., probability == 2^0) 1: an adjusted count of 2 (i.e., probability == 2^-1) 2: an adjusted count of 4 (i.e., probability == 2^-2) ... 62: an adjusted count of 2^62 63: an adjusted count of zero

In order to recognize unknown head sampling probability, we would propose a new trace context flag to indicate two things: (a) the most significant 64 bits of the TraceID are true random, (b) the head probability is known.

This proposal follows from research by Otmar Ertl, see https://arxiv.org/pdf/2107.07703.pdf.

jmacd avatar Aug 17 '21 18:08 jmacd

Summary of discussion from working group meeting:

  • In order to satisfy the paper linked above, the actual random number itself does not need to be propagated. It is sufficient to propagate only the calculated randomness and the head probability (6 bits each, encodable as RRPP where PP are two bytes of base16 probability value and RR are two bytes of base16 random value)
  • Randomness (required by this) and uniqueness (required by trace id) can be surprisingly complicated (see https://datatracker.ietf.org/doc/html/rfc4086 and https://datatracker.ietf.org/doc/html/rfc4122.html). We need to be very specific about our randomness/uniqueness requirements and how they may be met.
  • Enforcing randomness, or indeed any hard format restrictions, on trace ID was something that was opposed strongly in the past and is not likely to be accepted.
    • Alternative: add calculated randomness and head sampling probability RRPP to the end of the trace context header as a new field
    • Alternative: use a trace flag to denote that only some part of the trace id is format-restricted if and only if the flag is set.
      • Tracing systems who don't wish to use the restricted format simply propagate the header without special handling

dyladan avatar Aug 17 '21 20:08 dyladan