rosbag2 icon indicating copy to clipboard operation
rosbag2 copied to clipboard

👩‍🌾 PlaySrvsTest.set_rate_good_values flaky timeout

Open chapulina opened this issue 3 years ago • 2 comments

Description

The set_rate_good_values test has been flaky on the nightly jobs, see history for test_play_services__rmw_fastrtps_cpp :

https://ci.ros2.org/job/nightly_linux_repeated/lastSuccessfulBuild/testReport/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/history/

Expected Behavior

The test passes consistently.

Actual Behavior

The test is flaky, failing with a timeout after 60 s:

08:29:58 31: Test timeout computed to be: 60
08:29:58 31: -- run_test.py: extra environment variables:
08:29:58 31:  - RMW_IMPLEMENTATION=rmw_fastrtps_cpp
08:29:58 31: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport':
08:29:58 31:  - /home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_play_services__rmw_fastrtps_cpp --gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_repeated/ws/build/rosbag2_transport/test_results/rosbag2_transport/test_play_services__rmw_fastrtps_cpp.gtest.xml
08:29:58 31: Running main() from gmock_main.cc
08:29:58 31: [==========] Running 5 tests from 1 test suite.
08:29:58 31: [----------] Global test environment set-up.
08:29:58 31: [----------] 5 tests from PlaySrvsTest
08:29:58 31: [ RUN      ] PlaySrvsTest.pause_resume
08:29:59 31: [       OK ] PlaySrvsTest.pause_resume (566 ms)
08:29:59 31: [ RUN      ] PlaySrvsTest.toggle_paused
08:29:59 31: [       OK ] PlaySrvsTest.toggle_paused (499 ms)
08:29:59 31: [ RUN      ] PlaySrvsTest.set_rate_good_values
08:30:58     Test #31: test_play_services__rmw_fastrtps_cpp ...............***Timeout  60.01 sec

To Reproduce

Run nightly CI.

System (please complete the following information)

  • OS: Ubuntu Focal
  • ROS 2 Distro: Rolling
  • Version: master

Additional context

That test makes 4 successful_call calls:

https://github.com/ros2/rosbag2/blob/44e6422f40d424ebf3e741af058192e0b93b925a/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L279-L297

Each should have a 1s timeout:

https://github.com/ros2/rosbag2/blob/44e6422f40d424ebf3e741af058192e0b93b925a/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L205


https://github.com/osrf/buildfarmer/issues/224

chapulina avatar Aug 23 '21 21:08 chapulina

I'm trying reproduct the bug in local, but I failed, is there any way to produce it in local?

llapx avatar Sep 01 '21 02:09 llapx

Assigning this to myself and picking up to investigate. Note @llapx these can be hard to reproduce - you can try to use the settings from the repeated build above

test_args: --event-handlers console_direct+ --executor sequential --retest-until-fail 2 --ctest-args -LE "(linter|xfail)" --pytest-args -m "not linter and not xfail"

Additionally I often try to run the gtests directly with repeat, e.g.

source install/setup.bash
cd build/rosbag2_transport
RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./test_play_services__rmw_fastrtps_cpp --gtest_repeat=-1 --gtest_break_on_failure

That will run the test infinitely until it fails. While this is happening you can try to, perhaps, stress the CPU by using stress utility to see if it only happens in low-resource scenarios, for example.

Hopefully that helps, I am hoping that will reproduce the problem for me locally - otherwise I'll have to think of a next step.

emersonknapp avatar Sep 02 '21 18:09 emersonknapp

  • Closing as duplicate for the #862 since there more context and preliminary analysis of the issue.

MichaelOrlov avatar Feb 01 '23 05:02 MichaelOrlov