finagle icon indicating copy to clipboard operation
finagle copied to clipboard

Support propagating 128-bit trace identifiers

Open codefromthecrypt opened this issue 8 years ago • 11 comments

When Finagle receives a 128-bit trace identifier, it should be able to propagate it without reducing its size to 64-bits.

Expected behavior

TraceId should have an extra field to carry the high bits of a 128-bit trace id. In other instrumentation, this is called traceIdHigh (or traceIdHi). This should be initialized to zero, and encoded if set.

For example, if serializing to the X-B3-TraceId header, when non-zero, traceIdHigh becomes the left-most 16 hex characters. If serializing to a buffer, it prefixes the existing traceId bytes at position 16 of a 40 (extended from 32) byte buffer.

Actual behavior

Currently, TraceId.traceId is a SpanId which is a 64-bit long in representation. This is serialized as a 16character hex X-B3-TraceId header or 8 bytes in the TraceId binary form (offset 16).

Until TraceId.traceIdHigh or similar exists, and the serialized forms are updated accordingly, any application propagating through finagle will downgrade from 128 to 64bit trace identifiers. While this is fine in the current state of zipkin, eventually, this will become a problem (ex when propagating out to a different tracing system that requires the original 128-bit id)

codefromthecrypt avatar Oct 27 '16 08:10 codefromthecrypt

@kevinoliver @mosesn while the timing of implementing this isn't critical, I'd like buy-in on the approach used for binary serialization of TraceId (when 128-bits are in use). This is so that other libraries can update in anticipation of similar updates here.

The two most sensible choices seem to be..

  • check length and if 40 read traceIdHigh from position 16 (pushing everything else right 8)
  • check length and if 40 read traceIdHigh from position 32 (leaving everything else where it is now)

codefromthecrypt avatar Oct 27 '16 08:10 codefromthecrypt

@adriancole What approach are other systems taking here? Its not obvious to me what the tradeoffs are between those two choices.

kevinoliver avatar Nov 01 '16 18:11 kevinoliver

Brave (who's binary form in cassandra tracing context) uses the former (keep the two halves of the ID next to eachother in byte order). I think this makes most sense, but you do need to guard on length to see if you should read a traceIdHigh or not.

The latter was mainly to show two options :) Hypothetically, some demarshaller which works today might work by just choosing not to read the last 8 bytes. That said, it really depends on how the bytes are carried.

codefromthecrypt avatar Nov 02 '16 01:11 codefromthecrypt

So I was chatting with @ellispritchard who maintains https://github.com/Financial-Times/tapper, and @fishcakez who is using it in a mixed environment with finagle. Also, I've thought about this a while now.

I think the following is the best way for binary propagation:

  • check length and if 40 read traceIdHigh from position 32 (leaving everything else where it is now)

For others, we'd follow existing approaches that are working well for nearly a year now:

  • add traceIdHigh field to trace context carriers, initialized to zero
  • encode traceIdHigh as the left-most (high bits) of the X-B3-TraceId header when not zero.
  • for binary, add traceIdHigh as the last 8 bytes (when reading, check-length)

I'm happy to help with this. Now we've a business case (existing user needing it), seems a good time to start. Sound good?

codefromthecrypt avatar Aug 02 '17 02:08 codefromthecrypt

Sounds good to me!

mosesn avatar Aug 02 '17 03:08 mosesn

My company uses 128bit trace ids within our tracing system (based on zipkin). My team's stack is exclusively based on top of Finagle (Finatra, finagle-mysql, finagle-redis) and we regularly have colliding ids because of the reduction to Long within Finagle.

I'd be interested in evaluating the above suggestion when it's ready.

jimschubert avatar Sep 16 '17 15:09 jimschubert

@jimschubert would you be interested in taking a stab at it?

mosesn avatar Sep 16 '17 17:09 mosesn

@mosesn I don't have an lot of free time outside of work right now. I can see about allocating time for this at work, but identifying alternatives is low priority on our backlog at the moment. If I can get get the time, I'd love to take a stab at it. I'll let you know here if I can, and I'll hit you up on gitter if I get stuck.

jimschubert avatar Sep 16 '17 19:09 jimschubert

@jimschubert sounds good to me!

mosesn avatar Sep 17 '17 09:09 mosesn

@adriancole @mosesn I took an initial stab at this. I started down the road of adding support for 128-bit SpanID and ParentID, but I think it's overkill. Could you guys check out the linked PR #651 and give feedback?

jimschubert avatar Sep 21 '17 01:09 jimschubert

Thx for the start. Widening X-B3-TraceId seems the most pragmatic start.

For binary, we could consider the check length approach of growing thrift later or adopt census (aka dapper2) format though it doesnt send parent ID https://github.com/census-instrumentation/opencensus-specs/blob/master/encodings/BinaryEncoding.md

codefromthecrypt avatar Sep 21 '17 13:09 codefromthecrypt