rosbag2 CI test regression: test_play_services__rmw_fastrtps

Description

The test_play_services__rmw_fastrtps_cpp start periodically fail with new error message and failure scenario. Last time seen failure in Rdev__rosbag2__ubuntu_jammy_amd64 job#116 First time observed the same failure in Rdev__rosbag2__ubuntu_jammy_amd64 job#110 dated March 10, 2023.

There are new error messages in PlaySrvsTest.play_next which was not observed before

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at ./src/type_support_dispatch.hpp:111'

with this new error message:

  'publisher's context is invalid, at ./src/rcl/publisher.c:402'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1681669769.002323231] [rosbag2_player_for_test_srvs]: Failed to publish message on '/player_srvs_test_topic' topic. 
Error: failed to publish loaned message: publisher's context is invalid, at ./src/rcl/publisher.c:402
[ERROR] [1681669769.008321010] [rosbag2_player_for_test_srvs.rclcpp]: Error in destruction of rcl publisher handle: Failed to delete datawriter, at ./src/publisher.cpp:45, at ./src/rcl/publisher.c:187
Failed to delete dds publisher from participantFailed to delete participant[       OK ] PlaySrvsTest.play_next (477 ms)

The tests in test suite itself not failing, and there are two more tests successfully passing after PlaySrvsTest.play_next, although there are segmentation fault when gtest finishing execution of the test_play_services__rmw_fastrtps_cpp job with error message

[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (2653 ms total)
[  PASSED  ] 8 tests.
-- run_test.py: return code -11

Expected Behavior

The test_play_services__rmw_fastrtps_cpp shall finish cleanly without segmentation fault on destruction.

Actual Behavior

It looks like we are getting segmentation fault when finishing process. Likely during destruction of the some static object.

To Reproduce

I was able to reproduce this failure locally with heavy loading my machine during the test and repeating test for 50 times. I used stress -m 16 -c 16 -i 16 command.

checkout Rolling branch
colcon build --packages-up-to rosbag2_transport
colcon test --packages-select rosbag2_transport --ctest-args -R PlaySrvsTest.play_next --retest-until-fail 50 (Need to clarify if possible to specify to run only PlaySrvsTest.play_next test repeatedly. I was running tests from CLion and CLion has such option)
in another terminal window run stress -m 16 -c 16 -i 16 where 16 is the number of CPU cores on running machine.

System (please complete the following information)

OS: Ubuntu Jammy
ROS 2 Distro: Rolling
Version: https://build.ros2.org/job/Rdev__rosbag2__ubuntu_jammy_amd64/116/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/

Additional context

** Add any other context about the problem here **

Apr 29 '23 19:04 MichaelOrlov

Preliminary analysis

We are getting segmentation fault when invoking destructor of the eprosima::fastdds::dds::DomainParticipantFactory which is designed as singleton with static shared pointer this is explain why segmentation fault occurs only when all gtest exiting from test suite's process.

According to the call stack segmentation fault happening in eprosima::fastrtps::rtps::TopicPayloadPool::shrink() method in line 273 https://github.com/eProsima/Fast-DDS/blob/7606b96087a8bcfa7d8fe4434cd5d2d253607dfc/src/cpp/rtps/history/TopicPayloadPool.cpp#L273

Error reported before in PlaySrvsTest.play_next test

[ERROR] [1681669769.002323231] [rosbag2_player_for_test_srvs]: Failed to publish message on '/player_srvs_test_topic' topic. 
Error: failed to publish loaned message: publisher's context is invalid, at ./src/rcl/publisher.c:402
[ERROR] [1681669769.008321010] [rosbag2_player_for_test_srvs.rclcpp]: Error in destruction of rcl publisher handle: Failed to delete datawriter, at ./src/publisher.cpp:45, at ./src/rcl/publisher.c:187
Failed to delete dds publisher from participantFailed to delete participant[       OK ] PlaySrvsTest.play_next (477 ms)

Directly related to this segmentation fault. It seems some unexpected failure happened on Fast-DDS side when we are trying to publish message and context has been deleted. Other DDS implementations gracefully handle such situation and Fast-DDS before also was handling it proper way. But it seems currently it causes segfault on process exit.

Some details and specific about PlaySrvsTest.play_next test:

https://github.com/ros2/rosbag2/blob/6f373bbaf80ef9eeb3698caabab373627f2f2bb8/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L385-L414 In this test we are starting rosbag2 player in pause mode and then calling play_next_service call and waiting response to validate reception. Then we repeat such service call with response 200 times. This part of the test running correctly without any errors. After that we turning rosbag2 player in continuous playback mode with repeat and checking that call for service_call_play_next(); will return false since player not in pause mode. And that's all for this test. This part also running without issues. The interesting part starting when we are trying to destruct ~PlaySrvsTest() test fixture https://github.com/ros2/rosbag2/blob/6f373bbaf80ef9eeb3698caabab373627f2f2bb8/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L66-L72 As you can see we are basically calling rclcpp::shutdown(); to shutdown any publishers and subscriptions and let other active threads exit. At the same time our player still in playback mode and publishing messages every 20 ms. This is why we are getting error message that publish(message) called with invalid context.

In this particular scenario we can make workaround in test and stop player before calling rclcpp::shutdown();. But this is integration test and calling rclcpp::shutdown(); somewhere in the middle of running nodes is a common practice in ROS community and this is how we are stopping or interrupting all nodes in case of SIGINT signal. It will be more appropriate to make such scenario to be handled gracefully on Fast-DDS side and not causing segmentation fault on exit.

Some details about segmentation fault from debugger

Call stack for Thread-1

Thread 1 "test_play_servi" received signal SIGSEGV, Segmentation fault.
eprosima::fastrtps::rtps::TopicPayloadPool::shrink (this=this@entry=0x5603b9fd5940, max_num_payloads=0) at /home/morlov/ros2_rolling_new/src/eProsima/Fast-DDS/src/cpp/rtps/history/TopicPayloadPool.cpp:273
273	        all_payloads_.at(payload->data_index()) = all_payloads_.back();
[libfastrtps.so.2.10] eprosima::fastrtps::rtps::TopicPayloadPool::shrink TopicPayloadPool.cpp:273
[libfastrtps.so.2.10] eprosima::fastrtps::rtps::TopicPayloadPool::release_history TopicPayloadPool.cpp:174
[libfastrtps.so.2.10] eprosima::fastdds::dds::DataWriterImpl::release_payload_pool DataWriterImpl.cpp:1950
[libfastrtps.so.2.10] eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl DataWriterImpl.cpp:440
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DataWriterImpl::~DataWriterImpl DataWriterImpl.hpp:43
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DataWriterImpl::~DataWriterImpl DataWriterImpl.hpp:43
[libfastrtps.so.2.10] eprosima::fastdds::dds::PublisherImpl::~PublisherImpl PublisherImpl.cpp:111
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::PublisherImpl::~PublisherImpl PublisherImpl.hpp:40
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::PublisherImpl::~PublisherImpl PublisherImpl.hpp:40
[libfastrtps.so.2.10] eprosima::fastdds::dds::DomainParticipantImpl::~DomainParticipantImpl DomainParticipantImpl.cpp:201
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DomainParticipantImpl::~DomainParticipantImpl DomainParticipantImpl.hpp:57
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DomainParticipantImpl::~DomainParticipantImpl DomainParticipantImpl.hpp:57
[libfastrtps.so.2.10] eprosima::fastdds::dds::DomainParticipantFactory::~DomainParticipantFactory DomainParticipantFactory.cpp:69
[libfastrtps.so.2.10] eprosima::fastdds::dds::DomainParticipantFactory::~DomainParticipantFactory DomainParticipantFactory.cpp:82
[libfastrtps.so.2.10] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release shared_ptr_base.h:155
[libfastrtps.so.2.10] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release shared_ptr_base.h:148
[libfastrtps.so.2.10] std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count shared_ptr_base.h:730
[libfastrtps.so.2.10] std::__shared_ptr<eprosima::fastdds::dds::DomainParticipantFactory, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr shared_ptr_base.h:1169
[libfastrtps.so.2.10] std::shared_ptr<eprosima::fastdds::dds::DomainParticipantFactory>::~shared_ptr shared_ptr.h:103
[libc.so.6] <unknown> 0x00007ff43eeab8a7
[libc.so.6] exit 0x00007ff43eeaba60
[libc.so.6] __libc_start_main 0x00007ff43ee8908a
[test_play_services__rmw_fastrtps_cpp] _start 0x00005603b8808c2e

Variables values when segmentation fault happened

Signal = SIGSEGV (Segmentation fault)
payload = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x810 
buffer = {eprosima::fastrtps::rtps::octet *} 
*buffer = {eprosima::fastrtps::rtps::octet} 
data_offset = {const size_t} 12
this = {eprosima::fastrtps::rtps::PreallocatedTopicPayloadPool *const} 0x5603b9fd5940 
eprosima::fastrtps::rtps::TopicPayloadPool = {eprosima::fastrtps::rtps::TopicPayloadPool} 
eprosima::fastrtps::rtps::ITopicPayloadPool = {eprosima::fastrtps::rtps::ITopicPayloadPool} 
max_pool_size_ = {uint32_t} 0
infinite_histories_count_ = {uint32_t} 0
finite_max_pool_size_ = {uint32_t} 0
free_payloads_ = {std::vector<eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *>} 
all_payloads_ = {std::vector<eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *>} 
 [0] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603baacec70 
 [1] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603bac9a460 
 [2] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603ba6df240 
 [3] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603baee7b10 
 [4] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603b9f3cf10 
 [5] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603ba667be0 
mutex_ = {std::mutex} 
 std::__mutex_base = {std::__mutex_base} 
  _M_mutex = {std::__mutex_base::__native_type} 
   __data = {__pthread_mutex_s} 
    __lock = {int} 1
    __count = {unsigned int} 0
    __owner = {int} 18782
    __nusers = {unsigned int} 1
    __kind = {int} 0
    __spins = {short} 0
    __elision = {short} 0
    __list = {__pthread_list_t} 
     __prev = {__pthread_internal_list *} NULL
     __next = {__pthread_internal_list *} NULL
   __size = {char [40]} 
   __align = {long} 1
payload_size_ = {uint32_t} 52
minimum_pool_size_ = {uint32_t} 0
max_num_payloads = {uint32_t} 0

Stack tree and line of code where segmentation fault happened

@clalancette could you please help forward this issue to some engineers from eProsima who will be able to elaborate further on this issue?

Apr 30 '23 06:04 MichaelOrlov

@EduPonz @MiguelCompany can you take a look?

Apr 30 '23 16:04 clalancette

@Mario-DL will take a look at this. Thanks for the very thorough report @MichaelOrlov!

May 03 '23 06:05 EduPonz

Hello everyone !

Thanks @MichaelOrlov for the detailed report. I could reproduce the exact rcl error message and the latter TopicPayloadPool::shrink segfault you mentioned. I will further investigate and will come back to you asap.

May 05 '23 13:05 Mario-DL

Hi @MichaelOrlov

I found that the segfault (assertion failing) in TopicPayloadPool::shrink was introduced in ros2/rmw_fastrtps/568.

We will further elaborate on that because the problem can stem from the way in which loans() are managed either in the rcl or internally in Fast DDS

May 12 '23 13:05 Mario-DL

@Mario-DL Thanks for the finding breaking PR. Looking forward to seeing the fixes. It would be nice to have a fix or some workaround before the end of this Friday, May 19. To jump "on the train" for Iron final release. cc: @clalancette

May 16 '23 18:05 MichaelOrlov

Just to set expectations; unless this turns out to be a major bug, this is probably not going to make the Iron release. We only have about 48 hours in which to get anything new in, and changes to Fast-DDS or rmw_fastrtps (the default RMW vendor) would be pretty dangerous at this point.

May 16 '23 18:05 clalancette

Thanks @clalancette for the clarification.

I have been investigating further and following are the findings and errors characterization. The checks were basically based on repeating the PlaySrvsTest.play_next test multiple times (~50 each run).

ERR:01: failed to borrow loaned msg: publisher's context is invalid

This error is occasionally thrown by the rcl. The cause is calling to rclcpp::shutdown() before stopping the player's thread. The instance_id_storage of the context is set to 0 by the rclcpp::shutdown() thread while in the player's thread, rcl_publisher_is_valid() returns false, leading to the error.

This error log will be inevitable as long as the ~PlaySrvsTest() is not corrected i.e calling rclcpp::shutdown() after stopping and joining the player's thread.

ERR:02: Error in destruction of rcl publisher handle: Failed to delete datawriter

This error is sometimes thrown alongside with the previous one. If the rclcpp::shutdown() is called after the player's thread had retrieved the loan, the destroy_publisher() tries to delete_datawriter() but, as there are loans in use, it returns ReturnCode_t::RETCODE_PRECONDITION_NOT_MET, triggering the error.

ERR:03: TopicPayloadPool::shrink() assertion failing

This assertion is thrown only after running all the repetitions, when the test is being destroyed and, consequently, the DomainParticipantFactory singleton destructor is called. This destructor removes all the created entities. When trying to delete the player's datawriter, the TopicPayloadPool (buffer for all the serialized messages) cannot be released since it detects that there is a loan that was not returned to the pool, falling into the assertion.

The working principle of loans is that, once a sample loan is borrowed to an external application or actor (in this case, the rcl) it is responsibility of that application to return the loan for that message back, so that the resource can be freed. In this case, the correct way of proceeding would be that rcl stores the loans that it has borrowed (maybe inside each rcl_publisher_impl_s struct ? ) in order to correctly return them when required and to check that all loans of a loan-capable entity were returned, before trying to delete it. That would deal with the errors ERR:01 and * ERR:02.

How do you see it ?

May 17 '23 15:05 Mario-DL

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2023-05-18/31587/1

May 24 '23 15:05 ros-discourse

@clalancette @wjwwood Need your help with the decision on this issue in regards to how to proceed with the fix. In my opinion, the fix shall be done on Fast DDS side and here is the reasoning for that:

Fast DDS shouldn't cause segfault on domain participant destruction in any way.
Other DDS implementations like Cyclone DDS and Connext DDS gracefully handle such situation.
If we are deleting domain participant we imply that loaned memory from it is invalidated and will not be used anymore. And it seems a reasonable assumption in the general case.

Jun 01 '23 15:06 MichaelOrlov

@clalancette @wjwwood Friendly ping here. ^^^^ Need your decision/opinion on this task to be able to move forward with it.

Jun 12 '23 23:06 MichaelOrlov

@clalancette @wjwwood Friendly ping here. ^^^^

Jul 11 '23 01:07 MichaelOrlov

@clalancette Friendly ping here. ^^^

Jul 19 '23 20:07 MichaelOrlov

Closing this issue since workaround was implemented in the https://github.com/ros2/rosbag2/pull/1556

May 05 '24 03:05 MichaelOrlov

rosbag2
rosbag2 copied to clipboard

CI test regression: test_play_services__rmw_fastrtps_cpp

Description

Expected Behavior

Actual Behavior

To Reproduce

System (please complete the following information)

Additional context

Preliminary analysis

Some details and specific about PlaySrvsTest.play_next test:

Some details about segmentation fault from debugger

rosbag2 rosbag2 copied to clipboard

CI test regression: test_play_services__rmw_fastrtps_cpp

Description

Expected Behavior

Actual Behavior

To Reproduce

System (please complete the following information)

Additional context

Preliminary analysis

Some details and specific about PlaySrvsTest.play_next test:

Some details about segmentation fault from debugger

rosbag2
rosbag2 copied to clipboard