opentelemetry-js icon indicating copy to clipboard operation
opentelemetry-js copied to clipboard

Provide a supported way to get anchored clock times

Open dyladan opened this issue 1 year ago • 50 comments

Ref https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1193 Ref https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1209

I wanted to make an issue where we can discuss this separately from the contrib bugs and solve it in a more general way. Here is the relevant conversation so far:

@dyladan Any idea how we could expose the anchored clock to allow use cases like that one in express? @Flarna

We could provide it on the span but that would be an API change. We could also provide an SDK function which takes a span and returns its clock but that would be quite confusing to end users who just want a current time.

We should see if we can solve this more generally as express may not be the only place this happens and it is quite a difficult problem to solve. @dyladan

From an outside point of view, it seems indeed the most simple approach would be to let the SDK handle the clock internally and use span.end().

I am planning to use OpenTelemetry on a future project and experimenting different aspects of observability with a toy project, glad to see the community is reactive and things are sorted out quickly. @alexd6631

We could provide it on the span but that would be an API change. We could also provide an SDK function which takes a span and returns its clock but that would be quite confusing to end users who just want a current time.

I think it should be on the API and quite independent of spans. Best would be something like a TimeProvider interface. SDK can plug in the implementation matching to the platform or even user might override it during SDK init. @Flarna

I think it should be on the API and quite independent of spans. Best would be something like a TimeProvider interface. SDK can plug in the implementation matching to the platform or even user might override it during SDK init.

How do you make sure you're getting the clock that is the correct anchored clock? I guess the time provider can use whatever clock the currently active span is using? @dyladan

dyladan avatar Sep 28 '22 13:09 dyladan

Just to have more context/understanding on the background issue you are addressing, is it related that the wall clock would be non monotonic (for instance a NTP update, could adjust the clock negatively) and you could end up with negative spans ?

alexd6631 avatar Sep 28 '22 13:09 alexd6631

Just to have more context/understanding on the background issue you are addressing, is it related that the wall clock would not be monotonic (for instance a NTP update, could adjust the clock negatively) and you could end up with negative spans ? @alexd6631

There are a couple issues at play here. Originally we used the monotonic clock in order to ensure spans always end after they start even if the system clock is modified during an operation. This can cause issues because the monotonic clock can be paused for various reasons including computer sleeping, lambda runtime pausing, and possibly others. This caused the clock to shift farther and farther into the past as the process lived longer. The monotonic clock also drifts naturally from the system clock as it is not NTP sync'd.

The solution we came up with was to use the system clock to determing the "real" time at the start of a trace, and use the monotonic clock for the rest of the trace. This ensures the whole trace is never drifted too far from real time, but also that times are monotonic within a trace.

dyladan avatar Sep 28 '22 13:09 dyladan

Ok the solution makes a lot of sense, especially if you consider the spans to be short-lived (few seconds) compared to the process.

So I guess it would not be appropriate for the API to expose a single current time, as you have a current time per trace (absolute start time of trace + delta in monotonic time), it would be natural to scope it by trace in the API.

alexd6631 avatar Sep 28 '22 13:09 alexd6631

Also as a basic / first-time user of Otel I don't see any particular use case where I would want to mess with the underlying clock. Do you have any specific use cases in mind, or is it to let flexibility on the user API (or just to be able to switch the strategy easily at a later point internally) ?

The sole use case (I can think of) where I would set manually the start/end time of spans is if I am getting logs / events in a legacy db and I want to convert them to traces afterward, but the current trace time would not help here.

alexd6631 avatar Sep 28 '22 13:09 alexd6631

How do you make sure you're getting the clock that is the correct anchored clock? I guess the time provider can use whatever clock the currently active span is using?

Agree that is not possible as the anchor is on the span currently and propagated to children. So one anchor per trace part within a process,

We could move away from using one anchor per process local trace by using the global anchored clock always. if anchor moves between span start/end SDK should ensure that end >= start. In the end it's an exchange of adjusting per spans vs adjusting per process local trace part.

If timestamps are used also in logs, metrics it's currently a bit confusing to have spans adjusted but the other parts not.

Flarna avatar Sep 28 '22 14:09 Flarna

Ok the solution makes a lot of sense, especially if you consider the spans to be short-lived (few seconds) compared to the process.

It's not that unusual to have spans with duration of hours. e.g. some CI run consisting of several builds and tests.

Flarna avatar Sep 28 '22 14:09 Flarna

We could move away from using one anchor per process local trace by using the global anchored clock always.

In that case why not just use Date?

dyladan avatar Sep 28 '22 14:09 dyladan

As far as I remember high precision timing (better then ms) was one initial targets. But not sure if this is still applicable - in special in browsers since they reduced the precision a while ago to offer protection against timing attacks and fingerprinting. So maybe using Date is the way to go.

Should we should add some hint in the API docs that specifying start/end/... times is likely a bad idea?

Should we also change span.end() to ensure that end>=start instead of only logging but keeping the invalid value?

Flarna avatar Sep 28 '22 14:09 Flarna

As far as I remember high precision timing (better then ms) was one initial targets. But not sure if this is still applicable - in special in browsers since they reduced the precision a while ago to offer protection against timing attacks and fingerprinting. So maybe using Date is the way to go.

That's what I remember too. Wouldn't be too hard to change all the existing clock methods to use Date behind the scenes. Possible to get bad span times during NTP syncs and other things like that but it shouldn't be that many bad spans/traces. We already have the problem where different hosts might have different times.

Should we also change span.end() to ensure that end>=start instead of only logging but keeping the invalid value?

I think we should at least do this. There is no case where an end<start is a good thing.

Should we should add some hint in the API docs that specifying start/end/... times is likely a bad idea?

We should at least document the possible problems if inconsistent times are somehow used.

dyladan avatar Sep 28 '22 14:09 dyladan

Just a suggestion/thinking out loud (and probably not applicable due to the existing API ^^), but why not having a "hot" span API intended for live tracing where you cannot specify the times, and a "cold" span API where you are responsible to set both start and end dates.

alexd6631 avatar Sep 28 '22 14:09 alexd6631

Just a suggestion/thinking out loud (and probably not applicable due to the existing API ^^), but why not having a "hot" span API intended for live tracing where you cannot specify the times, and a "cold" span API where you are responsible to set both start and end dates.

Mostly because that is not what was in the API specification

dyladan avatar Sep 28 '22 15:09 dyladan

@gregolsen provided an alternative suggested solution in #3320 where he makes the anchored clock available in the API

dyladan avatar Oct 13 '22 13:10 dyladan

Should we also change span.end() to ensure that end>=start instead of only logging but keeping the invalid value?

I think we should at least do this. There is no case where an end<start is a good thing.

Created a PR for this #3327

dyladan avatar Oct 13 '22 14:10 dyladan

Copying some conversation with @gregolsen in #3320 as it is relevant here:

Thanks for context @dyladan I haven't seen that issue when making this PR!

Question: if we start using Date.now() to set span start/end timestamps how that will work with the instrumentation that uses browser performance API that returns the number of milliseconds from the origin? e.g. document load instrumentation relies on browser performance API to get the entries and populate the trace with spans on different resources that were loaded. I'm thinking about a case when document load is used alongside XHR/Fetch and some custom instrumentation. Ideally all instrumentations should be using the same way to set the timestamps for the traces to be consistent. What you think? Am I missing something?

There's an interesting discussion in another PR I have: open-telemetry/opentelemetry-js-contrib#1092

@gregolsen

Thanks for context @dyladan I haven't seen that issue when making this PR!

Yeah I figured. Hard to stay on top of all the things ongoing if you're not in it every day so i thought a link might be helpful. It's also not a settled question by any means so we're definitely interested in any input you have.

Question: if we start using Date.now() to set span start/end timestamps how that will work with the instrumentation that uses browser performance API that returns the number of milliseconds from the origin?

The performance.now time still needs to be converted to a timestamp (usually by adding to timeOrigin), so in the end you end up with 2 timestamps from different clocks. The performance clock is subject to pauses, but the system clock is not. This definitely is a problem if you're using the web performance API. It actually affects the anchored clock too in its current form as the times from the web timing api are not adjusted to the anchor.

I don't have a great solution except that maybe we need to consider two different methods of gathering time, one for web (or short lived environments) using the performance clock and one for node using Date. Another option would be one @Flarna proposed at one point which would be to have a global anchored clock which has its anchor updated periodically. That clock would also have to provide an adjust method or similar so that times from the performance timings api can be adjusted to the anchor. You still have problems when the anchor is changed by a significant amount, but hopefully that is rare enough.

@dyladan

From the above conversation, I actually think it is worth considering an @opentelemetry/time package used by the SDK and instrumentation which provides a global anchored clock which is periodically updated.

dyladan avatar Oct 14 '22 14:10 dyladan

Just a bit of interesting context on system clock drift. Here was provided a little JS snippet to measure how system clock differs from the monotonic one. I opened it in Firefox (as it glitches in chrome sometimes) and let it run for about an hour occasionally running sudo sntp -sS pool.ntp.org to "fix" system clock – big spikes of delta returning back to 0 is that NTP adjustment. image

I also ran an experiment in production – I added start/end timestamps based on system clocks to all the spans. Unfortunately most of the spans we have are produced by XHR/Fetch instrumentations – they delay span end via setTimeout to capture useful data from performance API about the request – hence Date.now() on span end doesn't represent the actual end time of the span. But we do have custom spans which are simply started with tracer.startSpan and ended span.end() w/o custom timestamps. Hence comparing their existing duration with the one computed based on the system clock timestamps is fair.

P999 (99.9 percentile) is 3.9 milliseconds however there are lots of outliers – 2 on logarithmic scale means 100 milliseconds difference between the original duration and the one computed based on the system clock. image

Filtering out to only the spans with over 10ms difference in measured durations we can see that the relative mismatch is not as bad: with AVG duration of 900ms the AVG difference is 28ms ~3.1%.

image

I also found an interesting span (link is private, just storing for myself) which duration was recorded as 900ms (we are using OTE 1.6.0 at the moment) but, based on system clock, its duration is 0. I would trust system clock here as those cache lookups like are typically under a millisecond in duration. So, somehow, monotonic clock measured 900ms out of nowhere:

Screenshot 2022-10-14 at 17 58 17

gregolsen avatar Oct 14 '22 17:10 gregolsen

This trace from the comment above is actually really interesting – looks like there was a system clock adjustment by 915ms – it happened between the long span of 914.7ms and the previous span. But the fact that the long span is exactly 914.7ms means that the performance API monotonic duration also jumped by 915ms (hence the span duration).

How is that possible that both the system clock and values returned by performance.now() are jumped by the same 915ms forward? I thought performance.now() is not susceptible to these adjustments.

We also have performance.timeOrigin value at the time of the span creation stored in current_origin and it hasn't changed - so it must be the performance.now() value that suddenly was adjusted by 915ms:

Screenshot 2022-10-14 at 18 33 24

gregolsen avatar Oct 14 '22 17:10 gregolsen

Thanks for doing the research. Great to have hard numbers.

Contingent on my correct understanding of the graph, here are the conclusions I think we can draw:

  • Your Date.now clock is quite slow and drifting a lot from "real" time which is interesting
  • Your performance.now clock is fast compared to "real" time
  • Your performance.now clock is more accurate than Date.now
  • Neither clock got more than about 100ms out of sync

My understanding of NTP under normal operation is that unless a manual adjustment is made like you did there, it doesn't make large jumps, but instead modifies the system clock to run slower or faster as needed. Obviously in a browser environment you have no idea what the user will do with their system clock.

I ran the script in safari for admittedly much less time but it seems to stay within about 1ms. I'll let it keep running for a while and do an ntp update in about 30 minutes to see if that changes things.

image

I know browser vendors play with clocks in order to prevent fingerprinting, so I'm not sure what effect that may have on either or both clocks. I also wouldn't put it past browsers to implement these sorts of things themselves instead of relying on the actual system clock, so this behavior could easily vary from browser to browser.

I thought performance.now() is not susceptible to these adjustments.

I was also under the impression that it was not subject to adjustments. This is interesting information. are you sure that span didn't actually take 914ms? That's quite a large jump. It also looks like the next few spans took less time than the spans leading up to it. Not sure what happened there.

dyladan avatar Oct 14 '22 18:10 dyladan

I, of course, can't be 100% sure but those spans are cache lookups where the cache is simply an in-memory lru-cache – they are extremely fast and 914ms seems unlikely. But also what are the odds of a system clock jumping for the exact same amount before the span has ended? If the _internal.system_clock_start was smaller then the _internal.system_clock_end by 914ms for that span - it would make sense. But in this case system clock start and end are exactly the same for that span – in other words, system clocked has already jumped at the time the span was started. So the only logical explanation is that hrTime() returned two values with 914.7ms difference.

gregolsen avatar Oct 14 '22 19:10 gregolsen

After running quite a bit longer safari seems... consistent

image

dyladan avatar Oct 14 '22 19:10 dyladan

Oh, sorry - I forgot to mention, I'm not sure if safari is a good candidate for investigations as it doesn't seem like suffering from a difference between system clock and performance.timeOrigin + performance.now(): image

gregolsen avatar Oct 14 '22 19:10 gregolsen

We are measuring clock drift on every span by running Date.now() - (performance.timeOrigin + performance.now()). So as you can see safari pretty much doesn't have any difference at all. Max registered difference is 1ms (within margin of error) for safari: image

gregolsen avatar Oct 14 '22 19:10 gregolsen

I, of course, can't be 100% sure but those spans are cache lookups where the cache is simply an in-memory lru-cache – they are extremely fast and 914ms seems unlikely. But also what are the odds of a system clock jumping for the exact same amount before the span has ended? If the _internal.system_clock_start was smaller then the _internal.system_clock_end by 914ms for that span - it would make sense. But in this case system clock start and end are exactly the same for that span – in other words, system clocked has already jumped at the time the span was started. So the only logical explanation is that hrTime() returned two values with 914.7ms difference.

Yeah I'm really not sure what to make of that. I'm not very familiar with the particular performance characteristics of that module and even if I was I don't know much about your use of it.

What units are you measuring drift in? 24 million anythings for chrome seems like quite a gap from the system clock. Your graphs definitely seem to show chrome as the "problem child"

dyladan avatar Oct 14 '22 19:10 dyladan

Also, it's not really clear which if either of the clocks is more correct. We only know the difference between them

dyladan avatar Oct 14 '22 19:10 dyladan

I just found a counter example to my idea that performance.now() has jumped – long cache lookup was sandwiched in between two network requests. I found logs for both of them and can confirm that there was a 1.6s gap between the requests. So, at least in this case, the 1.5s duration is not a clock jump but rather a valid gap in the trace. Whether cache lookup itself is measured correctly is still an open question as I can't believe it took that much. Screenshot 2022-10-14 at 20 35 58

Chrome is indeed a problem child - those units are milliseconds, so 2,709,469,466 milliseconds is 2,709,469,466 / 1000 / 3600 / 24 = 31 day of a difference between Date.now() and performance.timeOrigin + performance.now(). It is hard to believe, however just today I asked my colleagues to measure the difference in their browsers – the maximum result I got was 7.5 days!

We have instrumented spans with many timestamps: we have the original timestamp set by OTEL (comes from hrTime()), proxy timestamp (we proxy all spans via our backend API) and also Honeycomb provides ingestion timestamp when it got the span into their system. Proxy timestamp, system clock and ingestion time are in alignment, while the timestamp set by OTEL is far in the past thanks to the computers sleeping. At the moment we are correcting span timestamps on the backend – we measure the difference Date.now() - (performance.timeOrigin + performance.now()) at the time of the trace start. Then set it as a global attribute to every span of that trace - so that every span of a trace is shifted by the exact same amount. We only shift the spans if the difference is positive, as in when Date.now() is larger then the time via performance API. Negative values indicate system clock drift (the one we saw produced by that JS snippet).

Here are two heatmaps - for positive clock drift and negative. Negative clock drift is tiny while positive drift has gigantic outliers.

image

image

gregolsen avatar Oct 14 '22 19:10 gregolsen

Chrome is indeed a problem child - those units are milliseconds, so 2,709,469,466 milliseconds is 2,709,469,466 / 1000 / 3600 / 24 = 31 day of a difference between Date.now() and performance.timeOrigin + performance.now(). It is hard to believe, however just today I asked my colleagues to measure the difference in their browsers – the maximum result I got was 7.5 days!

This is unfortunately not that uncommon from my (admittedly recently acquired) understanding. I think chrome's timeorigin might be calculated from the browser startup not from each tab start?

We have instrumented spans with many timestamps: we have the original timestamp set by OTEL (comes from hrTime()), proxy timestamp (we proxy all spans via our backend API) and also Honeycomb provides ingestion timestamp when it got the span into their system. Proxy timestamp, system clock and ingestion time are in alignment, while the timestamp set by OTEL is far in the past thanks to the computers sleeping. At the moment we are correcting span timestamps on the backend – we measure the difference Date.now() - (performance.timeOrigin + performance.now()) at the time of the trace start. Then set it as a global attribute to every span of that trace - so that every span of a trace is shifted by the exact same amount. We only shift the spans if the difference is positive, as in when Date.now() is larger then the time via performance API. Negative values indicate system clock drift (the one we saw produced by that JS snippet).

Here are two heatmaps - for positive clock drift and negative. Negative clock drift is tiny while positive drift has gigantic outliers.

image

image

Interesting. Even with the new anchored clock implementation you have drifts this bad? Or is this just direct hrTime measuring not the span time?

dyladan avatar Oct 20 '22 13:10 dyladan

It seems given the above info that the performance timer is likely to almost always be some amount in the past. Any span ended with a time manually provided from the performance API is likely to be negative duration. Even with the quick fix being released soon those spans will all have 0 duration, which is useless for monitoring. The previous situation where spans were time shifted was not great, but span duration is one of our most important pieces of data and we can't just throw it away.

dyladan avatar Oct 20 '22 16:10 dyladan

The options I see:

  • Provide a method for shifting times from the performance api to the currently anchored clock
  • shift all times at the exporter based on an offset calculated from Date for each exported batch
  • shift all times when span ends instead of when span starts
  • provide a parameter for end different than endTime (performanceEndTime?) which is shifted

dyladan avatar Oct 20 '22 17:10 dyladan

The options I see:

  • Provide a method for shifting times from the performance api to the currently anchored clock
  • shift all times at the exporter based on an offset calculated from Date for each exported batch
  • shift all times when span ends instead of when span starts
  • provide a parameter for end different than endTime (performanceEndTime?) which is shifted

I'm lucky to see that you have been paying attention to this problem. If the browser side data collection plug-in is based on performance It is definitely wrong to use timeOrigin to calculate the start and end time. It is only suitable for calculating the duration. So I want to know when this problem can be solved? Because our application has been used in the production environment, there will be some errors if you want to query the data of a certain day. Continuously waiting

yuanman0109 avatar Oct 21 '22 07:10 yuanman0109

Focusing on just the trace telemetry for a moment, there are three things to optimize for:

  • Span timestamps being close enough to the actual time
  • Span durations are as precise as possible
  • Span timestamps within a given trace are consistent

From the customer's perspective I believe that optimizing for last 2 conditions is more important. When I look at a given trace I want to be 100% sure that durations indeed represent the actual duration. I also want to make sure that the trace wasn't artificially stretched due to span timestamps shifted to the past inconsistently. On the other hand, I'd be OK if some of the traces are still in the past assuming it is a small percentage of those. I can even filter them out of the observability tooling if that is a problem – it is easy to detect when having a ingestion time recorded as well. The other two inconsistencies is harder to detect and mitigate.

gregolsen avatar Oct 21 '22 08:10 gregolsen

Building on top of the above, I believe it is important to have a trace-awareness built into the anchored clock in some way. To make sure that all the spans of a given trace are using exactly the same anchor. This ensures all the spans are shifted by the exact same amount of time. Having a global anchored clock unaware of the trace will make it hard to ensure consistency across the spans of a given trace if that global anchor is periodically updated. Similarly, shifting spans before export doesn't guarantee that spans of a given trace will be shifted by the same amount of time if the delta between system clock and performance API has changed. This would be mitigated if the global anchored clock is never updated and set once on page load. But, if computer sleeps many times while the tab is opened the anchor won't be correct anymore.

We measure time drift before every trace start and it reduces the number of traces in the past to a tiny fraction: Here we have released a revision 49330 and you can see how this and other newer revisions are replacing each other – chart counts the number of traces. image But if I filter to the time interval before that revision was released I can still see some traces in the past (result of a sleep problem). However it is only 68 traces overall - 0.0059% - tiny fraction. image

gregolsen avatar Oct 21 '22 10:10 gregolsen