otp icon indicating copy to clipboard operation
otp copied to clipboard

Improve handling of lagging timers

Open juhlig opened this issue 2 years ago • 4 comments

This PR was inspired by this thread on Erlang Forums. Assuming I understood correctly what @garazdawi pointed out here, timer precision on Windows is 16ms, which could cause timers with timeouts of less than 16ms to be up to 16ms late. This can't be helped.

In the case of interval timers, in the current implementation this would also cause them to increasingly lag behind (or miss events, from another perspective) if their interval was less than 16ms. For example, given a 1ms interval timer, a timeout message would only be sent by the timer at a 16ms tick, and start a new timer with a time in the past, which would also be delivered only at the next 16ms tick. The 1ms interval would thus be stretched to 16ms, or from another perspective, 15 of the 16 timeout events would be lost.

This PR changes the rescheduling behavior of interval timers such that, if the next timeout of an interval timer is already in the past, instead of starting a new timer the timeout message is sent directly. Why send a message instead of just repeating the action the expected number of times? First, that wouldn't work with zero-interval timers (unlikely to be used (in the old pre-OTP25 timer implementation it would have stalled the timer server), but still allowed); second, it allows other messages that arrived after the timeout message to be processed.

This does not remedy the up-to-16ms lateness issue. It just issues all the missed events quickly, until the timer has catched up. Put differently, what will happen now given a 1ms interval timer is that while an event may be 16ms late, the 15 other events that should have been there during that delay are happening in a burst.

Now, I may have missed or got something wrong in the linked discussion, and this change may have no effect at all. I have no way of testing it on Windows. Also, I'm not sure if this burst-like processing is wanted, though it was done in a similar way in the old timer. So, take this as just a suggestion.

juhlig avatar Jul 11 '22 11:07 juhlig

CT Test Results

       2 files       86 suites   31m 48s :stopwatch: 1 783 tests 1 735 :heavy_check_mark: 48 :zzz: 0 :x: 2 065 runs  2 015 :heavy_check_mark: 50 :zzz: 0 :x:

Results for commit 1f58d338.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

github-actions[bot] avatar Jul 11 '22 11:07 github-actions[bot]

After a quick look, it looks nice. Since we are in the middle of the vacation period now it will, however, take a while before we'll look further into it.

rickard-green avatar Jul 15 '22 21:07 rickard-green

Since we are in the middle of the vacation period now it will, however, take a while before we'll look further into it.

Sure, no problem, I wish you all a nice vacation then :)

juhlig avatar Jul 18 '22 08:07 juhlig

@juhlig wrote

In the case of interval timers, in the current implementation this would also cause them to increasingly lag behind (or miss events, from another perspective) if their interval was less than 16ms. For example, given a 1ms interval timer, a timeout message would only be sent by the timer at a 16ms tick, and start a new timer with a time in the past, which would also be delivered only at the next 16ms tick. The 1ms interval would thus be stretched to 16ms, or from another perspective, 15 of the 16 timeout events would be lost.

No, no timeouts will be lost even without this change. They will however be triggered late. A timer set to an absolute time that has already passed or is at current time will be triggered as soon as possible, i.e, it wont wait until the next 16ms has passed. Since the code add the interval to the absolute time you will eventually reach a situation where you got a whole bunch of timers which will be triggered at once. When stabilized you would typically trigger 16 1ms interval timers every 16 millisecond if the time is updated every 16 millisecond.

I won't reject this PR though. It is still nice to send the timeout message explicitly instead of creating the timer when time already has passed. I'd like the special handling of interval zero removed though, it should be nicely take care of by the other code as well (or do I miss something?).

rickard-green avatar Aug 15 '22 16:08 rickard-green

I'll get back to this next week, right now I'm still on vacation 😉

juhlig avatar Aug 20 '22 11:08 juhlig

No, no timeouts will be lost even without this change. They will however be triggered late. A timer set to an absolute time that has already passed or is at current time will be triggered as soon as possible, i.e, it wont wait until the next 16ms has passed. Since the code add the interval to the absolute time you will eventually reach a situation where you got a whole bunch of timers which will be triggered at once. When stabilized you would typically trigger 16 1ms interval timers every 16 millisecond if the time is updated every 16 millisecond.

Ok, then it looks like I got this wrong, or rather I asked the wrong question :sweat_smile:

I'd like the special handling of interval zero removed though, it should be nicely take care of by the other code as well (or do I miss something?).

Yes and no, depends... :wink: On the one hand, for a zero-timeout interval timer it is clear that the next timeout message is to be sent immediately, so the calculation whether the next time has passed or not is not necessary. On the other hand, this requires another clause to be checked for every arriving timeout message. Now, given that zero-timeout interval timers didn't even work before #4811 (the timer server would stall) but nobody ever complained, it looks like this is never used in practice, so most of the time the zero-timeout clause would be checked in vain. So, tl;dr, I guess it is better to remove the special handling as you suggest. Agree? :wink:

juhlig avatar Aug 22 '22 08:08 juhlig

@rickard-green I removed the special handling for interval zero in the last commit.

There is a strange failure in the doc build that seems unrelated to this PR.

juhlig avatar Aug 24 '22 12:08 juhlig

@juhlig great! I was on vacation last week. I'll put it back into our daily testing (it gets automatically removed when someone pushes to a PR) and let it live there for a couple of days before merging it.

rickard-green avatar Aug 29 '22 10:08 rickard-green

You know what, I think I have an idea that may significantly reduce the congestion on the timer server that may be created by many short-interval timers, and thereby make them lag less. But I'll have to turn it over in my head a bit more, I haven't worked it out fully yet. Anyway, it won't block or be blocked by this PR, so go ahead and merge whenever you're ready.

Maria-12648430 avatar Aug 29 '22 11:08 Maria-12648430

You know what, I think I have an idea

... and now I'm ~worried~ thrilled... :grin:

that may significantly reduce the congestion on the timer server that may be created by many short-interval timers, and thereby make them lag less.

Ok, let's hear it :smiley:

juhlig avatar Aug 29 '22 13:08 juhlig

Ok, let's hear it smiley

@juhlig ---> IRC?

Maria-12648430 avatar Aug 29 '22 13:08 Maria-12648430

@rickard-green we were much faster than I had expected, and now have opened a competing PR #6256. What do you think?

Maria-12648430 avatar Aug 29 '22 14:08 Maria-12648430

@rickard-green we were much faster than I had expected, and now have opened a competing PR #6256. What do you think?

I've taken this PR out of the daily tests and put #6256 in testing.

rickard-green avatar Aug 29 '22 18:08 rickard-green

Superseded by #6256.

juhlig avatar Sep 09 '22 22:09 juhlig