opentelemetry-js
opentelemetry-js copied to clipboard
fix(tracing): make spans resilient to performance clock drift
Relates to #3279 particularly https://github.com/open-telemetry/opentelemetry-js/issues/3279#issuecomment-1302575571
Fixes #3355
Codecov Report
Merging #3384 (eb4c2ae) into main (bd56547) will decrease coverage by
0.70%. The diff coverage is91.07%.
Additional details and impacted files
@@ Coverage Diff @@
## main #3384 +/- ##
==========================================
- Coverage 93.27% 92.57% -0.71%
==========================================
Files 247 132 -115
Lines 7348 3152 -4196
Branches 1512 656 -856
==========================================
- Hits 6854 2918 -3936
+ Misses 494 234 -260
| Impacted Files | Coverage Δ | |
|---|---|---|
| packages/opentelemetry-core/src/common/time.ts | 95.45% <75.00%> (+0.06%) |
:arrow_up: |
| packages/opentelemetry-sdk-trace-base/src/Span.ts | 97.87% <93.75%> (-2.13%) |
:arrow_down: |
| api/src/platform/node/globalThis.ts | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...s/opentelemetry-core/src/platform/node/sdk-info.ts | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...opentelemetry-core/src/platform/node/globalThis.ts | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...pentelemetry-core/src/platform/node/performance.ts | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| .../packages/api-logs/src/platform/node/globalThis.ts | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...lemetry-resources/src/detectors/ProcessDetector.ts | 31.81% <0.00%> (-68.19%) |
:arrow_down: |
| ...perimental/packages/otlp-exporter-base/src/util.ts | 78.94% <0.00%> (-15.79%) |
:arrow_down: |
| ...ckages/opentelemetry-exporter-zipkin/src/zipkin.ts | 84.48% <0.00%> (-15.52%) |
:arrow_down: |
| ... and 119 more |
Made a sandbox with this patch, most frequently used otel web instrumentations (document-load, user-interaction, xhr and fetch) and an artificial time drift simulator:
https://codesandbox.io/s/modest-fermat-4eu09e?file=/src/otel-fixes.ts
It doesn't improve this test case a lot (user interaction's in right time scale, as it's auto + auto), pretty much all cases of performance timestamps are still affected by drift
- document load: performance time start + performance time end:

- fetch: auto start + performance time end (duration becomes negative -> set to 0)

It doesn't improve this test case a lot (user interaction's in right time scale, as it's auto + auto), pretty much all cases of performance timestamps are still affected by drift
These instrumentations all use auto start and manual end times.
duration becomes negative -> set to 0
If the duration is negative it should be getting the drift offset correction applied, not just set to 0. If you look at the diagram in the linked comment there is no situation where it should be set to 0, so there must be some bad code path.
Updated codesandbox with pr changes as of now
This was the root of the 0-duration spans in your example too
Yep fixed on auto start + perf end, however span events (input: perf timestamp) also need correcting (now they end up outside the span w/ drift)

These instrumentations all use auto start and manual end times.
document-load spans don't, as they use manual (performance API) times for both start and end, and these still get drifted:

document-load spans don't, as they use manual (performance API) times for both start and end, and these still get drifted
I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.
If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.
There is currently an assumption that any numeric value lower than timeOrigin is a performance timestamp:
https://github.com/open-telemetry/opentelemetry-js/blob/1b7246abb49164ec5c2cff7b66956c454f230fed/packages/opentelemetry-core/src/common/time.ts#L88
So it's actually impossible to create spans in the past (w/ accounting for drift) anyway:

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.
We can document how we create timestamps to allow users to keep in sync. Maybe provide some helper APIs. And use this mechanism at least in the instrumentations hosted here.
If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.
There is currently an assumption that any numeric value lower than timeOrigin is a performance timestamp:
https://github.com/open-telemetry/opentelemetry-js/blob/1b7246abb49164ec5c2cff7b66956c454f230fed/packages/opentelemetry-core/src/common/time.ts#L88
So it's actually impossible to create spans in the past (w/ accounting for drift) anyway:
Hmm so there is. That is disappointing and actually strikes me as a bug. The time code in core attempts to do way too much magic IMO and really limits us. That specific one can probably be fixed by changing it so that the heuristic is < performance.now() instead of < performance.timeOrigin. It would at least limit impossible times to epoch + time the page has been open.
I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.
We can document how we create timestamps to allow users to keep in sync. Maybe provide some helper APIs. And use this mechanism at least in the instrumentations hosted here.
I agree here documentation is a bare minimum and providing a helper API of some kind to correct drift would be a good idea
This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.
Superseded by https://github.com/open-telemetry/opentelemetry-js/pull/3434.