apm-agent-dotnet
apm-agent-dotnet copied to clipboard
Move to Stopwatch.GetTimestamp for measuring duration
From https://discuss.elastic.co/t/incorrect-span-duration/299442
Currently we use DateTime.Now
, moving to Stopwatch.GetTimestamp
would give us higher precision.
Steps:
- [x] Decide if we want to make this change
- [ ] If yes, implement it
I could reproduce the behavior on a Windows 11 machine with .NET Framework 4.8.
In this constellation, the infamous "Windows 16 ms resolution inaccuracy" becomes evident immediately as the screenshot below shows.
Note: The code sample I used to reproduce the issue creates spans of the following durations:
- 20 ms
- 10 ms
- 5 ms
- 1 ms
- 1 tick
data:image/s3,"s3://crabby-images/5720c/5720c37444859065eb51398a7ca497840c2f72e0" alt="spans_dotnet48_default"
Running the same sample application with NET 5 (also on Windows) works, however.
data:image/s3,"s3://crabby-images/c4fa1/c4fa17fba03684fd61d0839600363c472894bd3f" alt="spans_dotnet50_default"
A quick prototype using a high-resolution timer (as suggested here or here) also gives the desired exact timings on Windows 11 / .NET Framework 4.8.
data:image/s3,"s3://crabby-images/eb9e7/eb9e7395cd68487e3501251f4cc0abdd3aa5ee5f" alt="spans_dotnet48_hires"
Suggested next steps
Update the implementation of TimeUtils.TimestampNow
so that it uses a custom high-resolution timer only on .NET full framework.
To my surprise, the basic unit test in my draft PR did not reproduce the problem. After further research, it turned out that the behavior is even different among .NET 4 Framework version on Windows.
-
net461
: This is the version used for the unit test projectElastic.Apm.Tests
. It usesC:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6.1\mscorlib.dll
and givesDateTime.UtcNow
with the expected precision. -
net480
: This is the version is used for my initial tests with a sample app. It usesC:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\mscorlib.dll
whereDateTime.UtcNow
does not provide the required precision.
Interestingly, the decompiled code for DateTime.Utc
looks the same in both versions:
//
// Summary:
// Gets a System.DateTime object that is set to the current date and time on this
// computer, expressed as the Coordinated Universal Time (UTC).
//
// Returns:
// An object whose value is the current UTC date and time.
[__DynamicallyInvokable]
public static DateTime UtcNow
{
[SecuritySafeCritical]
[__DynamicallyInvokable]
get
{
long num = 0L;
if (s_isLeapSecondsSupportedSystem)
{
FullSystemTime time = default(FullSystemTime);
GetSystemTimeWithLeapSecondsHandling(ref time);
return CreateDateTimeFromSystemTime(ref time);
}
num = GetSystemTimeAsFileTime();
return new DateTime((ulong)(num + 504911232000000000L) | 0x4000000000000000uL);
}
}
Apparently, the native calls are implemented differently in mscorlib
.
Using a custom high-resolution clock for .NET 4 frameworks (Windows-only) would still work thought. However, let's discuss this again shortly after your PTO @gregkalapos.
After some research, I have reached the conclusion that in order to fix this problem, we should take the monotonic clock approach (https://github.com/elastic/apm-agent-dotnet/issues/139) after all.
The original idea was to fix the observed problem for Windows/.NET Framework 4.8 specifically and follow up with the general switch to a monotonic clock implementation at a later stage.
However, after trying several approaches, I could not come up with a reliable isolated solution that would solve this problem in .NET Framework 4.8.
The main problem with this isolated fix is that it always requires a combination of a:
-
DateTime.Utc
timestamp - High-precision offset in ticks.
The viable candidate for such an offset would be StopWatch.Elapsed.Ticks
. However, this has the significant downside of a noticeable clock drift that occurs after a relatively short amount of time (up to a second within an hour).
For that reason, the StopWatch
needs to get "synchronized" (i.e. a new instance created) regularly with the wall clock time, which introduces two problems:
- Additional GC pressure due to repeated instance creation.
- More significantly, this causes time jumps (even in a negative direction) on each "synchronization".
As an alternative approach, would it be a big lift to update the code paths to accept a TimeProvider (via net8 or bcl nuget package) to provide indirection to grabbing the timestamp? This would allow netfx users to supply an alternative implementation that hits the precise windows time api (which backs the netcore+ DateTime.UtcNow
api https://github.com/dotnet/coreclr/pull/9736)
@stevejgordon Any thoughts on the above approach? (sorry for the @, just wanted to make sure someone saw this given the thread necromancy -- happy to also open a new thread if that's better)