tempo
tempo copied to clipboard
Adjust client/server span times to account for clock skew
Is your feature request related to a problem? Please describe.
Clock skew across machines can cause spans to appear at the "wrong" time in relation to the actual execution.
For example, a client
span calling out to a server
span handling the request would expect to have the server
span call exist wholly within the calling client
span.
Describe the solution you'd like
One way to address this that Jaeger UI uses is to adjust the server
span to fit within the calling client
span to preserve causality when viewing the trace. This is optional and can be turned off to see the trace with the exact timestamps reported (this would match how Tempo currently displays things). The choice of how to adjust server
spans is a bit arbitrary but centering it within the calling client
span as Jaeger does is a reasonable way to visually maintain the causality between spans.
I believe that Jaeger UI also displays a visual indication that this skew adjustment was applied.
Describe alternatives you've considered Ideally having times that are synchronized across participating machines would result in "good enough" timestamps to see what's happening, but this isn't always possible based on hosting: Cloud Provider limitations, multiple regions, multiple providers, Windows system time resolution, etc.
Additional context Jaeger Clock Skew Adjustment
Jaeger backend combines trace data from applications that are usually running on different hosts. The hardware clocks on the hosts often experience relative drift, known as the clock skew effect . Clock skew can make it difficult to reason about traces, for example, when a server span may appear to start earlier than the client span, which should not be possible. The query service implements a clock skew adjustment algorithm ( code ) to correct for clock drift, using the knowledge about causal relationships between spans. All adjusted spans have a warning displayed in the UI that provides the exact clock skew delta applied to its timestamps.
In the below example, the green is a client
span making a call to the orange server
span. Because of clock skew between these machines (they're Windows with a default system time resolution of ~15 ms.) the reported times make it look like the server side happened after the calling span returned a result. The red line under the client
span shows the server
span (and it's children) being moved over to align with the middle of the client
span.
Note that while this example has the server
span recorded with a time period later than the calling span, it is also possible for a server
span to have times that occur before the times of the calling span.