apm-agent-dotnet icon indicating copy to clipboard operation
apm-agent-dotnet copied to clipboard

Move to Stopwatch.GetTimestamp for measuring duration

Open gregkalapos opened this issue 2 years ago • 2 comments

From https://discuss.elastic.co/t/incorrect-span-duration/299442

Currently we use DateTime.Now, moving to Stopwatch.GetTimestamp would give us higher precision.

Steps:

  • [x] Decide if we want to make this change
  • [ ] If yes, implement it

gregkalapos avatar Mar 14 '22 20:03 gregkalapos

I could reproduce the behavior on a Windows 11 machine with .NET Framework 4.8.

In this constellation, the infamous "Windows 16 ms resolution inaccuracy" becomes evident immediately as the screenshot below shows.

Note: The code sample I used to reproduce the issue creates spans of the following durations:

  • 20 ms
  • 10 ms
  • 5 ms
  • 1 ms
  • 1 tick
spans_dotnet48_default

Running the same sample application with NET 5 (also on Windows) works, however.

spans_dotnet50_default

A quick prototype using a high-resolution timer (as suggested here or here) also gives the desired exact timings on Windows 11 / .NET Framework 4.8.

spans_dotnet48_hires

Suggested next steps

Update the implementation of TimeUtils.TimestampNow so that it uses a custom high-resolution timer only on .NET full framework.

z1c0 avatar Aug 24 '22 08:08 z1c0

To my surprise, the basic unit test in my draft PR did not reproduce the problem. After further research, it turned out that the behavior is even different among .NET 4 Framework version on Windows.

  • net461: This is the version used for the unit test project Elastic.Apm.Tests. It uses C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6.1\mscorlib.dll and gives DateTime.UtcNow with the expected precision.
  • net480: This is the version is used for my initial tests with a sample app. It usesC:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.8\mscorlib.dll where DateTime.UtcNow does not provide the required precision.

Interestingly, the decompiled code for DateTime.Utc looks the same in both versions:

//
// Summary:
//     Gets a System.DateTime object that is set to the current date and time on this
//     computer, expressed as the Coordinated Universal Time (UTC).
//
// Returns:
//     An object whose value is the current UTC date and time.
[__DynamicallyInvokable]
public static DateTime UtcNow
{
    [SecuritySafeCritical]
    [__DynamicallyInvokable]
    get
    {
        long num = 0L;
        if (s_isLeapSecondsSupportedSystem)
        {
            FullSystemTime time = default(FullSystemTime);
            GetSystemTimeWithLeapSecondsHandling(ref time);
            return CreateDateTimeFromSystemTime(ref time);
        }

        num = GetSystemTimeAsFileTime();
        return new DateTime((ulong)(num + 504911232000000000L) | 0x4000000000000000uL);
    }
}

Apparently, the native calls are implemented differently in mscorlib.

Using a custom high-resolution clock for .NET 4 frameworks (Windows-only) would still work thought. However, let's discuss this again shortly after your PTO @gregkalapos.

z1c0 avatar Sep 06 '22 09:09 z1c0

After some research, I have reached the conclusion that in order to fix this problem, we should take the monotonic clock approach (https://github.com/elastic/apm-agent-dotnet/issues/139) after all.

The original idea was to fix the observed problem for Windows/.NET Framework 4.8 specifically and follow up with the general switch to a monotonic clock implementation at a later stage.

However, after trying several approaches, I could not come up with a reliable isolated solution that would solve this problem in .NET Framework 4.8.

The main problem with this isolated fix is that it always requires a combination of a:

  • DateTime.Utc timestamp
  • High-precision offset in ticks.

The viable candidate for such an offset would be StopWatch.Elapsed.Ticks. However, this has the significant downside of a noticeable clock drift that occurs after a relatively short amount of time (up to a second within an hour). For that reason, the StopWatch needs to get "synchronized" (i.e. a new instance created) regularly with the wall clock time, which introduces two problems:

  • Additional GC pressure due to repeated instance creation.
  • More significantly, this causes time jumps (even in a negative direction) on each "synchronization".

z1c0 avatar Sep 29 '22 09:09 z1c0

As an alternative approach, would it be a big lift to update the code paths to accept a TimeProvider (via net8 or bcl nuget package) to provide indirection to grabbing the timestamp? This would allow netfx users to supply an alternative implementation that hits the precise windows time api (which backs the netcore+ DateTime.UtcNow api https://github.com/dotnet/coreclr/pull/9736)

aputlock avatar Mar 05 '24 21:03 aputlock

@stevejgordon Any thoughts on the above approach? (sorry for the @, just wanted to make sure someone saw this given the thread necromancy -- happy to also open a new thread if that's better)

aputlock avatar Apr 10 '24 14:04 aputlock