rcl rcl_wait() returns early when a timer awakes

Bug report

Required Info:

Operating System:
- Windows 10
Installation type:
- From source
Version or commit hash:
- master
DDS implementation:
- RTI Connext

Steps to reproduce issue

Run timer tests using rmw_connext_cpp.

Expected behavior

Tests pass.

Actual behavior

Often, tests fail when rcl_wait() returns early when a timer awakes.

Additional information

For one, even though a clock is specified for timers, how rmw_wait() abides to the given timeout is not specified. This issue persists for both steady time and system time timers.

This issue does not occur when using rmw_fastrtps*_cpp or rmw_cyclonedds_cpp, as these return at the right time or (most often) past it.

Suspecting a bug in rmw_connext_cpp, I could confirm that the expected timeout is passed to RTI Connext's WaitSet::wait() API which nonetheless returns early for both steady time and system time clocks. However, I believe this is less of a bug in RTI Connext than a misuse of rmw_wait() for rcl timers' implementation, considering time resolution. The closest achievable timeout on a given OS is not necessarily greater than or equal than any one specified.

Jun 16 '20 15:06 hidmic

One potential workaround would be to relax the definition of a ready rcl timer using a platform specific time tolerance, and even make that configurable for a user to tune it.

Jun 16 '20 15:06 hidmic

Alternatively, since it's quite likely not every possible timer period can be achieved due to time resolution constrains, the expectation on rcl timers' period could be relaxed to be just an average quantity over some time window. For instance, if some OS time resolution is 200us and a timer is configured with 500us period, by sleeping 400us and 600us in an alternate fashion it can average the intended 2kHz.

Jun 16 '20 15:06 hidmic

CC @ivanpauno @wjwwood @peterpena

Jun 16 '20 15:06 hidmic

I guess what I don't understand about this is that if you ask for 1 nanosecond sleep, then yeah, it might take the OS considerably longer than that to return (maybe on the order of 10ms for time slicing). But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.

Extending this to rmw_wait(), I would expect the same thing. It seems like the contract for rmw_wait() should be "wait for at least wait_timeout, unless there was a ready entity, then return early". Do you agree, or am I misunderstanding the problem entirely?

Jun 16 '20 16:06 clalancette

That is true for a plain sleep, but arguable for a time deadline (and even more so if you want to sustain some periodicity, but that's unrelated to rmw_wait()). And I have the impression that may be the case in Windows under specific circumstances (see second paragraph in this MSVC documentation article). We do not know how RTI Connext implements rmw_wait() timeouts though.

Jun 16 '20 17:06 hidmic

I agree with @clalancette about the expectation that it should never return before the time period the user specified. This is the case on Linux and macOS, and I would have extended that to rmw_wait or dds_wait, but maybe that's not guaranteed.

If a timer interval, or relative expiration time, is instead specified, the expiration can occur up to a period earlier than or a period later than the specified time, depending on where exactly the start and end times of this interval fall between system clock ticks.

That block from the Microsoft docs makes me sad, but basically I agree it could end early based on that, assuming that's the same kind of timer that is being used by RTI.

This is also a condition we could check for and at least log something about it. Or even we could go back to sleep if needed.

On the other hand, we could adjust the tests (and our documentation) to reflect reality on Windows (if I understand the core issue correctly).

Jun 16 '20 17:06 wjwwood

This is also a condition we could check for and at least log something about it. Or even we could go back to sleep if needed.

That's fair for wait() APIs, but I'm not entirely convinced we should ignore how this affects rcl timers performance if we consistently overshoot timer timeouts. Or the documentation should be explicit about timer periods being a rough lower bound.

On the other hand, we could adjust the tests (and our documentation) to reflect reality on Windows (if I understand the core issue correctly).

I agree with updating wait() APIs documentation accordingly. Specifying what clock that timeout is defined against would be good too.

Jun 16 '20 17:06 hidmic

I guess what I don't understand about this is that if you ask for 1 nanosecond sleep, then yeah, it might take the OS considerably longer than that to return (maybe on the order of 10ms for time slicing)

:+1:

But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.

:+1: I think this one is the problem we have to solve.

And I have the impression that may be the case in Windows under specific circumstances (see second paragraph in this MSVC documentation article).

It's not only that. rmw_wait uses always the system clock (at least for connext), and timers can use any clock. Thus, I think the logic in rcl to decide the rmw_wait duration isn't valid, and can be problematic in all platforms. We might add a tolerance or not, but we do need some sort of "retry wait" logic in case rmw_wait waked up early.

Jun 16 '20 17:06 ivanpauno

Thus, I think the logic in rcl to decide the rmw_wait duration isn't valid, and can be problematic in all platforms.

There's that too, yeah. And waking up earlier won't do if the timer clock goes faster than the system's (typical in CI settings using simulation with an RTF > 1). This is a separate issue though.

But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.

I will say that a delay (or sleep) and a timeout are two conceptually different things. If you ask for a 10ms delay, then yes, you'd expect it to last 10ms at least. But if you set a 10ms timeout, I'd expect it to last 10ms at most. And thus RTI Connext's WaitSet::wait() behavior makes more sense to me than Fast-RTPS' or CycloneDDS' if and only if it's not possible to get any closer to the requested time due to clock resolution.

Jun 16 '20 18:06 hidmic

I will point out that that MSVC article says that starting on Windows 8, KeDelayExecutionThread is a lot more precise. It doesn't say anything about KeWaitForSingleObject or KeSetTimer, so I assume they still have the old behavior and can wake up early. Unfortunately, we don't really know which of the APIs Connext is using.

That's fair for wait() APIs, but I'm not entirely convinced we should ignore how this affects rcl timers performance if we consistently overshoot timer timeouts. Or the documentation should be explicit about timer periods being a rough lower bound.

To be fair, on a non-realtime OS, all timer periods are a rough lower bound. So I think that expectation is pretty well set, and I think we could go back to sleep in that case (note that in the above discussion, I've been ignoring signals on Unix anyway, which is another case that might cause a sleep to wake up early and have to go back to sleep). But I definitely agree with you that the documentation should be clear about this.

I will say that a delay (or sleep) and a timeout are two conceptually different things. If you ask for a 10ms delay, then yes, you'd expect it to last 10ms at least. But if you set a 10ms timeout, I'd expect it to last 10ms at most.

I don't think so. At least, that's not how I understand e.g. the timeout on select, the man page for which says:

       The timeout argument specifies the interval that select() should block waiting for a file descriptor to become ready.  The call will block  until
       either:

       *  a file descriptor becomes ready;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note  that  the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval
       may overrun by a small amount.  If both fields of the timeval structure are zero,  then  select()  returns  immediately.   (This  is  useful  for
       polling.)  If timeout is NULL (no timeout), select() can block indefinitely.

Jun 16 '20 18:06 clalancette

To be fair, on a non-realtime OS, all timer periods are a rough lower bound.

That is true. Not sure about embedded ROS though.

At least, that's not how I understand e.g. the timeout on select

IMHO the choices that select() (or any other POSIX API) makes about how to deal with clock resolution is of no consequence. It does make it way easier to implement if the specified time is greater than 0 but less than one clock resolution unit. We can make the same choice, I just think we can do better.

Jun 16 '20 18:06 hidmic

I think resolving this issue could also resolve some rclpy related failures: https://github.com/ros2/rclpy/issues/267 https://github.com/ros2/rclpy/issues/513

Or at least they are related, the rclpy issues happen across multiple rmw implementations.

Jun 16 '20 20:06 brawner

But if you set a 10ms timeout, I'd expect it to last 10ms at most.

Yeah, I can't agree with this either. For me a timeout is always "we'll start trying to exit after this period of time". In fact I'd never assume it comes back before or "on time".

Jun 16 '20 20:06 wjwwood

But if you set a 10ms timeout, I'd expect it to last 10ms at most.

:+1: too

Jun 16 '20 21:06 ivanpauno

Oh well, I guess we agree to disagree.

In any case, we have an rmw_wait() that may return too early and an rcl_timer_t that depends on it not doing that. I'll try to come up with a patch to fix the latter. For the former, besides documenting expectations, maybe we can ask RTI for clarification on timeout behavior.

Jun 16 '20 21:06 hidmic

I think resolving this issue could also resolve some rclpy related failures: ros2/rclpy#267 ros2/rclpy#513

Or at least they are related, the rclpy issues happen across multiple rmw implementations.

AFAIR, those failures are due to the timer waking up too late (windows server scheduler time slice is much bigger than macOS and linux). They might also be related with this though.

Jun 16 '20 21:06 ivanpauno

@neil-rti maybe you can weigh in here.

Jun 17 '20 20:06 hidmic

With https://github.com/ros2/rmw/pull/275, rmw_wait() API documentation now mentions this little timing detail. I still have to benchmark modified timers in #707 though.

Oct 21 '20 15:10 hidmic

This one's still causing regressions (I think): https://ci.ros2.org/view/nightly/job/nightly_win_rel/lastCompletedBuild/testReport/projectroot/test/test_timer__rmw_connextdds/

Apr 06 '21 16:04 hidmic

@asorbini had some thoughts about this. I don't remember the exact details, but there was a mismatch on Windows between the clock source that Connext is using internally and the one that we are using in rcl. Maybe he can comment.

Apr 06 '21 17:04 clalancette

My guess was that these errors might have had something to do with the fact that rcutils uses GetSystemTimePreciseAsFileTime() while Connext uses _ftime(). I posited that the problem might be stemming from the different resolutions supported by these functions (1ns and 1ms, respectively). I'll paste here my comments from an email thread. I doubt there is actually any "rounding up" happening, but it might instead have something to do with a "timers might wake up up to one period early" issue on Windows, that I could swear I saw mentioned in some ros2 issue, but I can't find at the moment...

Anyway, I'm afraid I don't yet have a good answer to why this is actually happening.

my investigation on the failure in TestTimerFixture.test_two_timers_ready_before_timeout on Windows leads me to confirm that the issue stems from the different clocks used by Connext and by rcl.

The rcl_wait() call in that test seems to sometimes wakeup at most 1ms earlier than expected, and this would be consistent (I think) with the fact that Connext uses a 1ms-resolution clock (provided by _ftime()), while rcl relies on a finer-grain clock (provided by GetSystemTimePreciseAsFileTime()). I haven't checked this assumption yet, but I think that rcl implements timers by making the timeout passed to rmw_wait() smaller than the next timer to trigger, which would be consistent with my theory: Connext is relying on its internal clock to detect the timeout, and ends up waking up too early, e.g. the clock tick used by Connext is "rounded up" to 10ms while the clock used by rcl still reports, say, 9500000ns.

This would explain why I've seen the test fail in two different points:

Sometimes here: https://github.com/ros2/rcl/blob/master/rcl/test/rcl/test_timer.cpp#L318 when checking if the timer expected to have triggered is actually ready.

And sometimes here https://github.com/ros2/rcl/blob/master/rcl/test/rcl/test_timer.cpp#L322, which means that the timer had actually become ready by the time rcl_timer_is_ready() was called, but it wasn't when rmw_wait() returned and it found that the timer hadn't actually expired yet and thus set its pointer to NULL in the results (again, making some assumption here, because I haven't reviewed this code in rcl yet).

While this investigation leads me to exclude problems in the RMW, it still brings up a concerning situation for Windows users of rmw_connextdds, which, as of now, risk seeing a wait on a timer return at most 1ms too early (this upper bound is still another guess of mine, and should be confirmed...). I think the bug could be avoided by coding the application "defensively" upon return from rcl_wait(), to check that the timer is actually ready, and then calling rcl_wait() again if it isn't. This would of course be an unfortunate burden on user, and I'm not even sure if it's possible from other language bindings (e.g. rclcpp, and rclpy).

Note that once ros2/rmw_connextdds#22 is merged, the default implementation of WaitSets for rmw_connextdds will not rely on DDS objects anymore, so we might see changes (hopefully positive) with these test failures.

EDIT: maybe I was thinking of this Microsoft page:

If a timer interval, or relative expiration time, is instead specified, the expiration can occur up to a period earlier than or a period later than the specified time, depending on where exactly the start and end times of this interval fall between system clock ticks.

Apr 06 '21 18:04 asorbini

Thanks for the follow-up!

Apr 06 '21 18:04 clalancette