wcf icon indicating copy to clipboard operation
wcf copied to clipboard

Use Environment.TickCount64 to calculate how much time has passed for ReliableSessions session timeout

Open DareDevilDenis opened this issue 3 years ago • 8 comments

Original bug report below. This issue is going to be used to track the feature enhancement of switching to Environment.TickCount64 instead of doing math with DateTime.UtcNow as Environment.TickCount64 doesn't have time travel problems when the system clock is changed.

Describe the bug I have a WCF NetTcpBinding server (.NET framework 4.8) that has a binding that uses ReliableSession. I have a WCF client running on the same Windows PC that establishes a connection to the server and keeps the connection open for a long time. Both sides have the ReliableSession InactivityTimeout set to 120 seconds.

The issue is: If the PC's system clock is manually advanced by 3 minutes then the WCF connection drops.

You might say "don't change the system clock!", but in our corporate environment we have PCs who's time is synchronised to a time server, and on one of the PC it automatically advanced the time by a few minutes, thus causing the WCF connection to drop. I think that WCF should use it's own internal timers for the reliable session, not the PC system time.

To Reproduce Steps to reproduce the behavior:

  • Establish a WCF NetTcpBinding ReliableSession with InactivityTimeout set to 120 seconds
  • Advance the PC system clock by 3 minutes

Expected behavior The WCF connection should not drop (especially since it's all happening on the same PC).

DareDevilDenis avatar Feb 24 '22 18:02 DareDevilDenis

We have experienced a similar issue to this, running .Net 5, mcr.microsoft.com/dotnet/aspnet:5.0-focal

This also occurred when the timeserver was out of sync. It appears to be solved now after setting the timeserver in sync again.
We did not find the cause of the issues, however your situation seems similar to ours.

jongpieter avatar Feb 28 '22 10:02 jongpieter

Reliable sessions keeps track of the last time it had communication with the server. When you move the system clock forward 3 minutes, it now believes it has been 3 minutes since it last had any communication. This is over the threshold where WCF decides that the connection is unrecoverable as it presumes it has tried reconnecting during that time. If you expect the clock to jump forwards to be a regular occurrence, you can modify the binding to increase the tolerance of how long it's willing to go without having communicated with the server. You can modify the inactivity timeout on the property NetTcpBinding.ReliableSession.InactivityTimeout. We have retry mechanism where we attempt to retry at half this time, so increasing the InactivityTimeout will result in it taking longer to discover an issue and reconnect. For example if you set the timeout to 5 minutes and make a call and the socket dies shortly after, it will take 2 1/2 minutes before we attempt to reconnect. You have a choice between tolerating longer times to detect a problem and tolerance of the clock jumping forward. There really isn't another way to know how long it's been since we've last had communication than using the system clock.
Closing the issue as it's working exactly as intended and we can't do anything about time travelling system clocks other than to believe the time reported to us.

mconnew avatar Mar 03 '22 22:03 mconnew

Thanks for the explanation @mconnew.

Why does the system clock have to be used to determine how long since the last communication? .NET internal timers e.g. System.Threading.Timer are not affected by the system clock and so are more robust.

I don't think the end user shouldn't be able to cause an existing WCF connection to drop simply because they decide to change the system clock (e.g. changing the time for daylight savings time).

DareDevilDenis avatar Mar 07 '22 10:03 DareDevilDenis

There's a difference between saying "Do X in 180 seconds" and "I've just received something on the network, how long has it been since the last time I received something?". You can't answer that second question using System.Threading.Timer.

System.Threading.Timer and ManualResetEvent (that you mentioned before your edit) uses system clock ticks to achieve the wait. Basically, there's a system clock tick which is generally configured to fire at a set period, usually 15ms but can be modified by applications. This clock tick drives a lot of concurrency in Windows. For example, thread context switches happen on this clock tick. As it's based on ticks, it's immune to clock skew. But it also has the side effect of having a reduced resolution. You can't for example set a timer for 20 ms. It will either fire in 15ms or 30ms.

I had a look at the source behind Stopwatch as that can provide timelapse information which isn't vulnerable to clock skew, but it has 2 problems. First, it's not consistent. On some systems it will just use the clock date and time so is vulnerable to clock skew anyway. It's also quite an expensive operation compared to DateTime.Now. I did find the api Environment.TickCount64 which looks like it might be a useable replacement. It has the same resolution problems as it's based on Ticks still, but it doesn't translate it into a ms time so don't need to do any math other than subtraction.

So it looks like I may have found a viable .NET api to use, but the problem is the amount of time to implement and the priority. The only way a user can change the system clock is if they have Admin rights. If a user has Admin rights and is messing with things which requires admin privileges, they are going to break things. There are many other things which get broken by users messing with the clock. For example, Windows auth is only tolerant to a maximum of 15 minutes of clock skew. Web page caches are timestamped with expiry times. If users are going to have Admin privileges and then go and mess around with system settings, the problem is the user who is messing with system settings. They could also go and mess around with firewall rules and break things too.

You can monitor for system clock changes, so maybe a warning would be appropriate to tell the end user that what they are doing might cause problems. Here's the docs.

There are other issues with a lot higher return on impact than rewriting the session expiry code in Reliable Sessions when the problem is to accommodate problematic users who mess around with the system clock. If you want to submit a PR to modify the behavior to use Environment.TickCount64, as long as it does the equivalent behavior I would accept it. We have reasonably good tests around making sure timeouts happen when they should for ReliableSessions. Bringing named pipes support will unblock a lot more developers than making Reliable Sessions tolerant to a curious end user messing with system settings. I say "curious" because and "messing with" as Windows has built in ability to sync the system clock to a time server and that's on by default so there's really zero reason to be touching the system clock.

mconnew avatar Mar 08 '22 01:03 mconnew

Thanks for the detailed response @mconnew

This weekend in the US the clocks go forward by 1 hour for daylight saving time (Sunday March 13 at 2am). This means that a PC with "Adjust for daylight saving time automatically" set will suffer from this issue: Reliable session WCF connections will drop. I have customers running overnight test campaigns so I am expecting to get tickets raised by them because of this.

So I don't think the problem is restricted to users with admin rights messing with the system settings. I appreciate that there are higher return issues than this however I think this would be very nice robustness improvement.

By the way (some background info), the way that I first bumped into this issue was from an customer reporting dropped connections where both the client and server were running on a single virtual machine. Upon inspection of the machine it was found that every 17 minutes the system clock jumped back by 3 minutes for about 30 seconds and then jumped forwards again. The reason was due to the VM updating the time from the VM host through VMware, independently from the OS. Turns out that the host and OS were using different time servers!

DareDevilDenis avatar Mar 08 '22 10:03 DareDevilDenis

I wouldn't expect that to break anything. The time isn't changing, just the timezone. I believe we use Datetime.UtcNow everywhere important which should remain constant through a daylight savings change.

mconnew avatar Mar 08 '22 15:03 mconnew

Thanks @mconnew. That's good news - so the problem is not as bad as I feared 😊. I just tried this (changing the time zone and turning daylight savings on and off) and as you suggested - I works. Sorry about that.

Is there a way to track the "Environment.TickCount64" enhancement idea not as a bug but as a small feature?

DareDevilDenis avatar Mar 11 '22 14:03 DareDevilDenis

I'll retitle this issue and can use this to track it.

mconnew avatar Mar 11 '22 18:03 mconnew