opentelemetry-js icon indicating copy to clipboard operation
opentelemetry-js copied to clipboard

fix(tracing): make spans resilient to performance clock drift

Open dyladan opened this issue 1 year ago • 8 comments

Relates to #3279 particularly https://github.com/open-telemetry/opentelemetry-js/issues/3279#issuecomment-1302575571

Fixes #3355

dyladan avatar Nov 04 '22 20:11 dyladan

Codecov Report

Merging #3384 (eb4c2ae) into main (bd56547) will decrease coverage by 0.70%. The diff coverage is 91.07%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3384      +/-   ##
==========================================
- Coverage   93.27%   92.57%   -0.71%     
==========================================
  Files         247      132     -115     
  Lines        7348     3152    -4196     
  Branches     1512      656     -856     
==========================================
- Hits         6854     2918    -3936     
+ Misses        494      234     -260     
Impacted Files Coverage Δ
packages/opentelemetry-core/src/common/time.ts 95.45% <75.00%> (+0.06%) :arrow_up:
packages/opentelemetry-sdk-trace-base/src/Span.ts 97.87% <93.75%> (-2.13%) :arrow_down:
api/src/platform/node/globalThis.ts 0.00% <0.00%> (-100.00%) :arrow_down:
...s/opentelemetry-core/src/platform/node/sdk-info.ts 0.00% <0.00%> (-100.00%) :arrow_down:
...opentelemetry-core/src/platform/node/globalThis.ts 0.00% <0.00%> (-100.00%) :arrow_down:
...pentelemetry-core/src/platform/node/performance.ts 0.00% <0.00%> (-100.00%) :arrow_down:
.../packages/api-logs/src/platform/node/globalThis.ts 0.00% <0.00%> (-100.00%) :arrow_down:
...lemetry-resources/src/detectors/ProcessDetector.ts 31.81% <0.00%> (-68.19%) :arrow_down:
...perimental/packages/otlp-exporter-base/src/util.ts 78.94% <0.00%> (-15.79%) :arrow_down:
...ckages/opentelemetry-exporter-zipkin/src/zipkin.ts 84.48% <0.00%> (-15.52%) :arrow_down:
... and 119 more

codecov[bot] avatar Nov 10 '22 20:11 codecov[bot]

Made a sandbox with this patch, most frequently used otel web instrumentations (document-load, user-interaction, xhr and fetch) and an artificial time drift simulator:

https://codesandbox.io/s/modest-fermat-4eu09e?file=/src/otel-fixes.ts

It doesn't improve this test case a lot (user interaction's in right time scale, as it's auto + auto), pretty much all cases of performance timestamps are still affected by drift

  1. document load: performance time start + performance time end:

image

  1. fetch: auto start + performance time end (duration becomes negative -> set to 0)

image

t2t2 avatar Nov 14 '22 16:11 t2t2

It doesn't improve this test case a lot (user interaction's in right time scale, as it's auto + auto), pretty much all cases of performance timestamps are still affected by drift

These instrumentations all use auto start and manual end times.

duration becomes negative -> set to 0

If the duration is negative it should be getting the drift offset correction applied, not just set to 0. If you look at the diagram in the linked comment there is no situation where it should be set to 0, so there must be some bad code path.

dyladan avatar Nov 14 '22 19:11 dyladan

Updated codesandbox with pr changes as of now

This was the root of the 0-duration spans in your example too

Yep fixed on auto start + perf end, however span events (input: perf timestamp) also need correcting (now they end up outside the span w/ drift)

image

These instrumentations all use auto start and manual end times.

document-load spans don't, as they use manual (performance API) times for both start and end, and these still get drifted:

image

t2t2 avatar Nov 15 '22 09:11 t2t2

document-load spans don't, as they use manual (performance API) times for both start and end, and these still get drifted

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

dyladan avatar Nov 15 '22 12:11 dyladan

If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

There is currently an assumption that any numeric value lower than timeOrigin is a performance timestamp:

https://github.com/open-telemetry/opentelemetry-js/blob/1b7246abb49164ec5c2cff7b66956c454f230fed/packages/opentelemetry-core/src/common/time.ts#L88

So it's actually impossible to create spans in the past (w/ accounting for drift) anyway:

image

t2t2 avatar Nov 15 '22 13:11 t2t2

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

We can document how we create timestamps to allow users to keep in sync. Maybe provide some helper APIs. And use this mechanism at least in the instrumentations hosted here.

Flarna avatar Nov 15 '22 14:11 Flarna

If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

There is currently an assumption that any numeric value lower than timeOrigin is a performance timestamp:

https://github.com/open-telemetry/opentelemetry-js/blob/1b7246abb49164ec5c2cff7b66956c454f230fed/packages/opentelemetry-core/src/common/time.ts#L88

So it's actually impossible to create spans in the past (w/ accounting for drift) anyway:

Hmm so there is. That is disappointing and actually strikes me as a bug. The time code in core attempts to do way too much magic IMO and really limits us. That specific one can probably be fixed by changing it so that the heuristic is < performance.now() instead of < performance.timeOrigin. It would at least limit impossible times to epoch + time the page has been open.

I'm not sure what we can do there. If the user provides manual start and end times we can't correct drift because they may be intentionally describing a span that happened in the past.

We can document how we create timestamps to allow users to keep in sync. Maybe provide some helper APIs. And use this mechanism at least in the instrumentations hosted here.

I agree here documentation is a bare minimum and providing a helper API of some kind to correct drift would be a good idea

dyladan avatar Nov 15 '22 20:11 dyladan

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions[bot] avatar Jan 23 '23 06:01 github-actions[bot]

Superseded by https://github.com/open-telemetry/opentelemetry-js/pull/3434.

legendecas avatar Jan 24 '23 15:01 legendecas