rcl_wait() returns early when a timer awakes
Bug report
Required Info:
- Operating System:
- Windows 10
- Installation type:
- From source
- Version or commit hash:
- DDS implementation:
- RTI Connext
Steps to reproduce issue
Run timer tests using rmw_connext_cpp.
Expected behavior
Tests pass.
Actual behavior
Often, tests fail when rcl_wait() returns early when a timer awakes.
Additional information
For one, even though a clock is specified for timers, how rmw_wait() abides to the given timeout is not specified. This issue persists for both steady time and system time timers.
This issue does not occur when using rmw_fastrtps*_cpp or rmw_cyclonedds_cpp, as these return at the right time or (most often) past it.
Suspecting a bug in rmw_connext_cpp, I could confirm that the expected timeout is passed to RTI Connext's WaitSet::wait() API which nonetheless returns early for both steady time and system time clocks. However, I believe this is less of a bug in RTI Connext than a misuse of rmw_wait() for rcl timers' implementation, considering time resolution. The closest achievable timeout on a given OS is not necessarily greater than or equal than any one specified.
One potential workaround would be to relax the definition of a ready rcl timer using a platform specific time tolerance, and even make that configurable for a user to tune it.
Alternatively, since it's quite likely not every possible timer period can be achieved due to time resolution constrains, the expectation on rcl timers' period could be relaxed to be just an average quantity over some time window. For instance, if some OS time resolution is 200us and a timer is configured with 500us period, by sleeping 400us and 600us in an alternate fashion it can average the intended 2kHz.
CC @ivanpauno @wjwwood @peterpena
I guess what I don't understand about this is that if you ask for 1 nanosecond sleep, then yeah, it might take the OS considerably longer than that to return (maybe on the order of 10ms for time slicing). But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.
Extending this to rmw_wait(), I would expect the same thing. It seems like the contract for rmw_wait() should be "wait for at least wait_timeout, unless there was a ready entity, then return early". Do you agree, or am I misunderstanding the problem entirely?
That is true for a plain sleep, but arguable for a time deadline (and even more so if you want to sustain some periodicity, but that's unrelated to rmw_wait()). And I have the impression that may be the case in Windows under specific circumstances (see second paragraph in this MSVC documentation article). We do not know how RTI Connext implements rmw_wait() timeouts though.
I agree with @clalancette about the expectation that it should never return before the time period the user specified. This is the case on Linux and macOS, and I would have extended that to rmw_wait or dds_wait, but maybe that's not guaranteed.
If a timer interval, or relative expiration time, is instead specified, the expiration can occur up to a period earlier than or a period later than the specified time, depending on where exactly the start and end times of this interval fall between system clock ticks.
That block from the Microsoft docs makes me sad, but basically I agree it could end early based on that, assuming that's the same kind of timer that is being used by RTI.
This is also a condition we could check for and at least log something about it. Or even we could go back to sleep if needed.
On the other hand, we could adjust the tests (and our documentation) to reflect reality on Windows (if I understand the core issue correctly).
This is also a condition we could check for and at least log something about it. Or even we could go back to sleep if needed.
That's fair for wait() APIs, but I'm not entirely convinced we should ignore how this affects rcl timers performance if we consistently overshoot timer timeouts. Or the documentation should be explicit about timer periods being a rough lower bound.
On the other hand, we could adjust the tests (and our documentation) to reflect reality on Windows (if I understand the core issue correctly).
I agree with updating wait() APIs documentation accordingly. Specifying what clock that timeout is defined against would be good too.
I guess what I don't understand about this is that if you ask for 1 nanosecond sleep, then yeah, it might take the OS considerably longer than that to return (maybe on the order of 10ms for time slicing)
:+1:
But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.
:+1: I think this one is the problem we have to solve.
And I have the impression that may be the case in Windows under specific circumstances (see second paragraph in this MSVC documentation article).
It's not only that.
rmw_wait uses always the system clock (at least for connext), and timers can use any clock.
Thus, I think the logic in rcl to decide the rmw_wait duration isn't valid, and can be problematic in all platforms.
We might add a tolerance or not, but we do need some sort of "retry wait" logic in case rmw_wait waked up early.
Thus, I think the logic in rcl to decide the rmw_wait duration isn't valid, and can be problematic in all platforms.
There's that too, yeah. And waking up earlier won't do if the timer clock goes faster than the system's (typical in CI settings using simulation with an RTF > 1). This is a separate issue though.
But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.
I will say that a delay (or sleep) and a timeout are two conceptually different things. If you ask for a 10ms delay, then yes, you'd expect it to last 10ms at least. But if you set a 10ms timeout, I'd expect it to last 10ms at most. And thus RTI Connext's WaitSet::wait() behavior makes more sense to me than Fast-RTPS' or CycloneDDS' if and only if it's not possible to get any closer to the requested time due to clock resolution.
I will point out that that MSVC article says that starting on Windows 8, KeDelayExecutionThread is a lot more precise. It doesn't say anything about KeWaitForSingleObject or KeSetTimer, so I assume they still have the old behavior and can wake up early. Unfortunately, we don't really know which of the APIs Connext is using.
That's fair for
wait()APIs, but I'm not entirely convinced we should ignore how this affectsrcltimers performance if we consistently overshoot timer timeouts. Or the documentation should be explicit about timer periods being a rough lower bound.
To be fair, on a non-realtime OS, all timer periods are a rough lower bound. So I think that expectation is pretty well set, and I think we could go back to sleep in that case (note that in the above discussion, I've been ignoring signals on Unix anyway, which is another case that might cause a sleep to wake up early and have to go back to sleep). But I definitely agree with you that the documentation should be clear about this.
I will say that a delay (or sleep) and a timeout are two conceptually different things. If you ask for a 10ms delay, then yes, you'd expect it to last 10ms at least. But if you set a 10ms timeout, I'd expect it to last 10ms at most.
I don't think so. At least, that's not how I understand e.g. the timeout on select, the man page for which says:
The timeout argument specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until
either:
* a file descriptor becomes ready;
* the call is interrupted by a signal handler; or
* the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval
may overrun by a small amount. If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for
polling.) If timeout is NULL (no timeout), select() can block indefinitely.
To be fair, on a non-realtime OS, all timer periods are a rough lower bound.
That is true. Not sure about embedded ROS though.
At least, that's not how I understand e.g. the timeout on
select
IMHO the choices that select() (or any other POSIX API) makes about how to deal with clock resolution is of no consequence. It does make it way easier to implement if the specified time is greater than 0 but less than one clock resolution unit. We can make the same choice, I just think we can do better.
I think resolving this issue could also resolve some rclpy related failures: https://github.com/ros2/rclpy/issues/267 https://github.com/ros2/rclpy/issues/513
Or at least they are related, the rclpy issues happen across multiple rmw implementations.
But if you set a 10ms timeout, I'd expect it to last 10ms at most.
Yeah, I can't agree with this either. For me a timeout is always "we'll start trying to exit after this period of time". In fact I'd never assume it comes back before or "on time".
But if you set a 10ms timeout, I'd expect it to last 10ms at most.
:+1: too
Oh well, I guess we agree to disagree.
In any case, we have an rmw_wait() that may return too early and an rcl_timer_t that depends on it not doing that. I'll try to come up with a patch to fix the latter. For the former, besides documenting expectations, maybe we can ask RTI for clarification on timeout behavior.
I think resolving this issue could also resolve some rclpy related failures: ros2/rclpy#267 ros2/rclpy#513
Or at least they are related, the rclpy issues happen across multiple rmw implementations.
AFAIR, those failures are due to the timer waking up too late (windows server scheduler time slice is much bigger than macOS and linux). They might also be related with this though.
@neil-rti maybe you can weigh in here.
With https://github.com/ros2/rmw/pull/275, rmw_wait() API documentation now mentions this little timing detail. I still have to benchmark modified timers in #707 though.
This one's still causing regressions (I think): https://ci.ros2.org/view/nightly/job/nightly_win_rel/lastCompletedBuild/testReport/projectroot/test/test_timer__rmw_connextdds/
@asorbini had some thoughts about this. I don't remember the exact details, but there was a mismatch on Windows between the clock source that Connext is using internally and the one that we are using in rcl. Maybe he can comment.
My guess was that these errors might have had something to do with the fact that rcutils uses GetSystemTimePreciseAsFileTime() while Connext uses _ftime(). I posited that the problem might be stemming from the different resolutions supported by these functions (1ns and 1ms, respectively). I'll paste here my comments from an email thread. I doubt there is actually any "rounding up" happening, but it might instead have something to do with a "timers might wake up up to one period early" issue on Windows, that I could swear I saw mentioned in some ros2 issue, but I can't find at the moment...
Anyway, I'm afraid I don't yet have a good answer to why this is actually happening.
my investigation on the failure in
TestTimerFixture.test_two_timers_ready_before_timeouton Windows leads me to confirm that the issue stems from the different clocks used by Connext and by rcl.The
rcl_wait()call in that test seems to sometimes wakeup at most 1ms earlier than expected, and this would be consistent (I think) with the fact that Connext uses a 1ms-resolution clock (provided by_ftime()), while rcl relies on a finer-grain clock (provided byGetSystemTimePreciseAsFileTime()). I haven't checked this assumption yet, but I think thatrclimplements timers by making the timeout passed tormw_wait()smaller than the next timer to trigger, which would be consistent with my theory: Connext is relying on its internal clock to detect the timeout, and ends up waking up too early, e.g. the clock tick used by Connext is "rounded up" to 10ms while the clock used by rcl still reports, say, 9500000ns.This would explain why I've seen the test fail in two different points:
- Sometimes here: https://github.com/ros2/rcl/blob/master/rcl/test/rcl/test_timer.cpp#L318 when checking if the timer expected to have triggered is actually ready.
- And sometimes here https://github.com/ros2/rcl/blob/master/rcl/test/rcl/test_timer.cpp#L322, which means that the timer had actually become ready by the time
rcl_timer_is_ready()was called, but it wasn't whenrmw_wait()returned and it found that the timer hadn't actually expired yet and thus set its pointer to NULL in the results (again, making some assumption here, because I haven't reviewed this code in rcl yet).While this investigation leads me to exclude problems in the RMW, it still brings up a concerning situation for Windows users of rmw_connextdds, which, as of now, risk seeing a wait on a timer return at most 1ms too early (this upper bound is still another guess of mine, and should be confirmed...). I think the bug could be avoided by coding the application "defensively" upon return from
rcl_wait(), to check that the timer is actually ready, and then callingrcl_wait()again if it isn't. This would of course be an unfortunate burden on user, and I'm not even sure if it's possible from other language bindings (e.g.rclcpp, andrclpy).
Note that once ros2/rmw_connextdds#22 is merged, the default implementation of WaitSets for rmw_connextdds will not rely on DDS objects anymore, so we might see changes (hopefully positive) with these test failures.
EDIT: maybe I was thinking of this Microsoft page:
If a timer interval, or relative expiration time, is instead specified, the expiration can occur up to a period earlier than or a period later than the specified time, depending on where exactly the start and end times of this interval fall between system clock ticks.
Thanks for the follow-up!