message_filters icon indicating copy to clipboard operation
message_filters copied to clipboard

TimeSequencer.simple test hanging in CI

Open pbaughman opened this issue 5 years ago • 3 comments

One of our CI jobs just failed because the TimeSequencer.simple unit test hung:

      Start  4: message_filters-test_time_sequencer

4: Test command: /usr/bin/python3 "-u" "opt_path/share/ament_cmake_test/cmake/run_test.py" "ws_path/build/message_filters/test_results/message_filters/message_filters-test_time_sequencer.gtest.xml" "--package-name" "message_filters" "--output-file" "ws_path/build/message_filters/ament_cmake_gtest/message_filters-test_time_sequencer.txt" "--command" "ws_path/build/message_filters/message_filters-test_time_sequencer" "--gtest_output=xml:ws_path/build/message_filters/test_results/message_filters/message_filters-test_time_sequencer.gtest.xml"
4: Test timeout computed to be: 60
4: -- run_test.py: invoking following command in 'ws_path/src/path/message_filters':
4:  - ws_path/build/message_filters/message_filters-test_time_sequencer --gtest_output=xml:ws_path/build/message_filters/test_results/message_filters/message_filters-test_time_sequencer.gtest.xml
4: [==========] Running 3 tests from 1 test case.
4: [----------] Global test environment set-up.
4: [----------] 3 tests from TimeSequencer
4: [ RUN      ] TimeSequencer.simple
 4/13 Test  #4: message_filters-test_time_sequencer ............***Timeout  60.01 sec

A passing run in our CI takes about half a second to run, so this isn't a case of the test just taking a little longer than normal.

It's almost certainly getting stuck in rclcpp::spin_some(node) on line 97 or a few lines later. The queue of things to handle is either building up faster than spin_some can process, or some work item in the queue is adding a work item to the queue so spin_some never finishes.

I need to dive into this a little more to see what's being tested, but maybe we could spin until something specific happens, or spin until future complete? If someone reads this and has a suggestion, please chime in

pbaughman avatar Apr 11 '19 15:04 pbaughman

It looks like we've got message_filters.message_filters-test_fuzz marked as problematic too, possibly for the same reason

pbaughman avatar Apr 11 '19 18:04 pbaughman

happens in rolling, output in in below.

zs@zs-vm-2204:~/ga_ws/ga_ros/colcon/build/message_filters$ ./message_filters-test_fuzz 
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from TimeSequencer
[ RUN      ] TimeSequencer.fuzz_sequencer
unknown file: Failure
C++ exception with description "can't compare times with different time sources" thrown in the test body.
[  FAILED  ] TimeSequencer.fuzz_sequencer (26 ms)
[----------] 1 test from TimeSequencer (26 ms total)

[----------] 1 test from TimeSynchronizer
[ RUN      ] TimeSynchronizer.fuzz_synchronizer
[       OK ] TimeSynchronizer.fuzz_synchronizer (5013 ms)
[----------] 1 test from TimeSynchronizer (5013 ms total)

[----------] 1 test from Subscriber
[ RUN      ] Subscriber.fuzz_subscriber
[       OK ] Subscriber.fuzz_subscriber (5022 ms)
[----------] 1 test from Subscriber (5022 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 3 test suites ran. (10061 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TimeSequencer.fuzz_sequencer

 1 FAILED TEST

ZhenshengLee avatar Jan 12 '24 07:01 ZhenshengLee

After migrating from galactic to humble I encountered the same exception when using TimeSequencer. "can't compare times with different time sources"

There are two problematic comparisons, a first here when adding any message

if (mt::TimeStamp<M>::value(*evt.getMessage()) < last_time_)

and a second here during dispatch.

if ((stamp + delay_) <= rclcpp::Clock().now())

It seems that the problem was introduced by fixing issue #72. This changes the rcl_clock_type_t returned by all message_filters::message_traits::TimeStamp<M>::value() for Messages with a Header to RCL_ROS_TIME.

template<typename M>
struct TimeStamp<M, typename std::enable_if<HasHeader<M>::value>::type >
{
  static rclcpp::Time value(const M& m) {
    return rclcpp::Time(m.header.stamp, RCL_ROS_TIME);
  }
};

TimeSequencer internally uses a member last_time_ which is never explicitly initialized. Based on time.hpp (checked for humble), this should be the relevant constructor in that case, which sets rcl_clock_type_t to RCL_SYSTEM_TIME.

explicit Time(int64_t nanoseconds = 0, rcl_clock_type_t clock_type = RCL_SYSTEM_TIME);

The comparison between last_time_ and the result of message_filters::message_traits::TimeStamp<M>::value() leads to the first exception.

In dispatch the result of message_filters::message_traits::TimeStamp<M>::value() is compared with rclcpp::Clock().now(). Based on clock.hpp (checked for humble) this should be the relevant constructor that also sets rcl_clock_type_t to RCL_SYSTEM_TIME.

explicit Clock(rcl_clock_type_t clock_type = RCL_SYSTEM_TIME);

Sorry for the long comment, but here are some more thoughts:

In message_traits.h there is the template definition for getting the value of the messages header stamp (changed in #72), but the SFINAE that's there as fallback for messages without header would return time with a different rcl_clock_type_t. IMHO both these functions should return the same rcl_clock_type_t.

TimeSequencer already holds a rclcpp::Node, which could be used to get the clock and clock type. (I.e. node_->get_clock()->now() over rclcpp::Clock().now())

The unit tests in "time_sequencer_unittest.cpp" aren't affected by this problem as the test defines a custom struct Msg and provides a new template definition for getting the stamp, which uses RCL_SYSTEM_TIME. Thus within these unit tests everything currently uses RCL_SYSTEM_TIME.

I'm happy to follow up on this if my descriptions aren't clear enough. Is it fine to handle these problems here or should I open a new issue for that?

wedenikf avatar Feb 21 '24 10:02 wedenikf