tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Adjust client/server span times to account for clock skew

Open rkargMsft opened this issue 6 months ago • 4 comments

Is your feature request related to a problem? Please describe. Clock skew across machines can cause spans to appear at the "wrong" time in relation to the actual execution. For example, a client span calling out to a server span handling the request would expect to have the server span call exist wholly within the calling client span.

Describe the solution you'd like One way to address this that Jaeger UI uses is to adjust the server span to fit within the calling client span to preserve causality when viewing the trace. This is optional and can be turned off to see the trace with the exact timestamps reported (this would match how Tempo currently displays things). The choice of how to adjust server spans is a bit arbitrary but centering it within the calling client span as Jaeger does is a reasonable way to visually maintain the causality between spans. I believe that Jaeger UI also displays a visual indication that this skew adjustment was applied.

Describe alternatives you've considered Ideally having times that are synchronized across participating machines would result in "good enough" timestamps to see what's happening, but this isn't always possible based on hosting: Cloud Provider limitations, multiple regions, multiple providers, Windows system time resolution, etc.

Additional context Jaeger Clock Skew Adjustment

Jaeger backend combines trace data from applications that are usually running on different hosts. The hardware clocks on the hosts often experience relative drift, known as the clock skew effect  . Clock skew can make it difficult to reason about traces, for example, when a server span may appear to start earlier than the client span, which should not be possible. The query service implements a clock skew adjustment algorithm ( code  ) to correct for clock drift, using the knowledge about causal relationships between spans. All adjusted spans have a warning displayed in the UI that provides the exact clock skew delta applied to its timestamps.

In the below example, the green is a client span making a call to the orange server span. Because of clock skew between these machines (they're Windows with a default system time resolution of ~15 ms.) the reported times make it look like the server side happened after the calling span returned a result. The red line under the client span shows the server span (and it's children) being moved over to align with the middle of the client span.

Note that while this example has the server span recorded with a time period later than the calling span, it is also possible for a server span to have times that occur before the times of the calling span.

image

rkargMsft avatar Jul 31 '24 17:07 rkargMsft