message_filters
message_filters copied to clipboard
TimeSequencer.simple test hanging in CI
One of our CI jobs just failed because the TimeSequencer.simple unit test hung:
Start 4: message_filters-test_time_sequencer
4: Test command: /usr/bin/python3 "-u" "opt_path/share/ament_cmake_test/cmake/run_test.py" "ws_path/build/message_filters/test_results/message_filters/message_filters-test_time_sequencer.gtest.xml" "--package-name" "message_filters" "--output-file" "ws_path/build/message_filters/ament_cmake_gtest/message_filters-test_time_sequencer.txt" "--command" "ws_path/build/message_filters/message_filters-test_time_sequencer" "--gtest_output=xml:ws_path/build/message_filters/test_results/message_filters/message_filters-test_time_sequencer.gtest.xml"
4: Test timeout computed to be: 60
4: -- run_test.py: invoking following command in 'ws_path/src/path/message_filters':
4: - ws_path/build/message_filters/message_filters-test_time_sequencer --gtest_output=xml:ws_path/build/message_filters/test_results/message_filters/message_filters-test_time_sequencer.gtest.xml
4: [==========] Running 3 tests from 1 test case.
4: [----------] Global test environment set-up.
4: [----------] 3 tests from TimeSequencer
4: [ RUN ] TimeSequencer.simple
4/13 Test #4: message_filters-test_time_sequencer ............***Timeout 60.01 sec
A passing run in our CI takes about half a second to run, so this isn't a case of the test just taking a little longer than normal.
It's almost certainly getting stuck in rclcpp::spin_some(node)
on line 97 or a few lines later. The queue of things to handle is either building up faster than spin_some can process, or some work item in the queue is adding a work item to the queue so spin_some never finishes.
I need to dive into this a little more to see what's being tested, but maybe we could spin until something specific happens, or spin until future complete? If someone reads this and has a suggestion, please chime in
It looks like we've got message_filters.message_filters-test_fuzz
marked as problematic too, possibly for the same reason
happens in rolling, output in in below.
zs@zs-vm-2204:~/ga_ws/ga_ros/colcon/build/message_filters$ ./message_filters-test_fuzz
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TimeSequencer
[ RUN ] TimeSequencer.fuzz_sequencer
unknown file: Failure
C++ exception with description "can't compare times with different time sources" thrown in the test body.
[ FAILED ] TimeSequencer.fuzz_sequencer (26 ms)
[----------] 1 test from TimeSequencer (26 ms total)
[----------] 1 test from TimeSynchronizer
[ RUN ] TimeSynchronizer.fuzz_synchronizer
[ OK ] TimeSynchronizer.fuzz_synchronizer (5013 ms)
[----------] 1 test from TimeSynchronizer (5013 ms total)
[----------] 1 test from Subscriber
[ RUN ] Subscriber.fuzz_subscriber
[ OK ] Subscriber.fuzz_subscriber (5022 ms)
[----------] 1 test from Subscriber (5022 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 3 test suites ran. (10061 ms total)
[ PASSED ] 2 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] TimeSequencer.fuzz_sequencer
1 FAILED TEST
After migrating from galactic to humble I encountered the same exception when using TimeSequencer
. "can't compare times with different time sources"
There are two problematic comparisons, a first here when add
ing any message
if (mt::TimeStamp<M>::value(*evt.getMessage()) < last_time_)
and a second here during dispatch
.
if ((stamp + delay_) <= rclcpp::Clock().now())
It seems that the problem was introduced by fixing issue #72. This changes the rcl_clock_type_t
returned by all message_filters::message_traits::TimeStamp<M>::value()
for M
essages with a Header to RCL_ROS_TIME
.
template<typename M>
struct TimeStamp<M, typename std::enable_if<HasHeader<M>::value>::type >
{
static rclcpp::Time value(const M& m) {
return rclcpp::Time(m.header.stamp, RCL_ROS_TIME);
}
};
TimeSequencer
internally uses a member last_time_
which is never explicitly initialized. Based on time.hpp (checked for humble), this should be the relevant constructor in that case, which sets rcl_clock_type_t
to RCL_SYSTEM_TIME
.
explicit Time(int64_t nanoseconds = 0, rcl_clock_type_t clock_type = RCL_SYSTEM_TIME);
The comparison between last_time_
and the result of message_filters::message_traits::TimeStamp<M>::value()
leads to the first exception.
In dispatch
the result of message_filters::message_traits::TimeStamp<M>::value()
is compared with rclcpp::Clock().now()
. Based on clock.hpp (checked for humble) this should be the relevant constructor that also sets rcl_clock_type_t
to RCL_SYSTEM_TIME
.
explicit Clock(rcl_clock_type_t clock_type = RCL_SYSTEM_TIME);
Sorry for the long comment, but here are some more thoughts:
In message_traits.h there is the template definition for getting the value of the messages header stamp (changed in #72), but the SFINAE that's there as fallback for messages without header would return time with a different rcl_clock_type_t
. IMHO both these functions should return the same rcl_clock_type_t
.
TimeSequencer
already holds a rclcpp::Node
, which could be used to get the clock and clock type. (I.e. node_->get_clock()->now()
over rclcpp::Clock().now()
)
The unit tests in "time_sequencer_unittest.cpp" aren't affected by this problem as the test defines a custom struct Msg
and provides a new template definition for getting the stamp, which uses RCL_SYSTEM_TIME
. Thus within these unit tests everything currently uses RCL_SYSTEM_TIME
.
I'm happy to follow up on this if my descriptions aren't clear enough. Is it fine to handle these problems here or should I open a new issue for that?