rosbag2
rosbag2 copied to clipboard
CI test regression: test_play_services__rmw_fastrtps_cpp
Description
The test_play_services__rmw_fastrtps_cpp start periodically fail with new error message and failure scenario. Last time seen failure in Rdev__rosbag2__ubuntu_jammy_amd64 job#116 First time observed the same failure in Rdev__rosbag2__ubuntu_jammy_amd64 job#110 dated March 10, 2023.
There are new error messages in PlaySrvsTest.play_next
which was not observed before
>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:
'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at ./src/type_support_dispatch.hpp:111'
with this new error message:
'publisher's context is invalid, at ./src/rcl/publisher.c:402'
rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1681669769.002323231] [rosbag2_player_for_test_srvs]: Failed to publish message on '/player_srvs_test_topic' topic.
Error: failed to publish loaned message: publisher's context is invalid, at ./src/rcl/publisher.c:402
[ERROR] [1681669769.008321010] [rosbag2_player_for_test_srvs.rclcpp]: Error in destruction of rcl publisher handle: Failed to delete datawriter, at ./src/publisher.cpp:45, at ./src/rcl/publisher.c:187
Failed to delete dds publisher from participantFailed to delete participant[ OK ] PlaySrvsTest.play_next (477 ms)
The tests in test suite itself not failing, and there are two more tests successfully passing after PlaySrvsTest.play_next
, although there are segmentation fault when gtest finishing execution of the test_play_services__rmw_fastrtps_cpp
job with error message
[----------] Global test environment tear-down
[==========] 8 tests from 1 test suite ran. (2653 ms total)
[ PASSED ] 8 tests.
-- run_test.py: return code -11
Expected Behavior
The test_play_services__rmw_fastrtps_cpp
shall finish cleanly without segmentation fault on destruction.
Actual Behavior
It looks like we are getting segmentation fault when finishing process. Likely during destruction of the some static object.
To Reproduce
I was able to reproduce this failure locally with heavy loading my machine during the test and repeating test for 50 times.
I used stress -m 16 -c 16 -i 16
command.
- checkout Rolling branch
-
colcon build --packages-up-to rosbag2_transport
-
colcon test --packages-select rosbag2_transport --ctest-args -R PlaySrvsTest.play_next --retest-until-fail 50
(Need to clarify if possible to specify to run onlyPlaySrvsTest.play_next
test repeatedly. I was running tests from CLion and CLion has such option) - in another terminal window run
stress -m 16 -c 16 -i 16
where 16 is the number of CPU cores on running machine.
System (please complete the following information)
- OS: Ubuntu Jammy
- ROS 2 Distro: Rolling
- Version: https://build.ros2.org/job/Rdev__rosbag2__ubuntu_jammy_amd64/116/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/
Additional context
** Add any other context about the problem here **
Preliminary analysis
We are getting segmentation fault when invoking destructor of the eprosima::fastdds::dds::DomainParticipantFactory
which is designed as singleton with static shared pointer this is explain why segmentation fault occurs only when all gtest exiting from test suite's process.
According to the call stack segmentation fault happening in eprosima::fastrtps::rtps::TopicPayloadPool::shrink()
method in line 273 https://github.com/eProsima/Fast-DDS/blob/7606b96087a8bcfa7d8fe4434cd5d2d253607dfc/src/cpp/rtps/history/TopicPayloadPool.cpp#L273
Error reported before in PlaySrvsTest.play_next
test
[ERROR] [1681669769.002323231] [rosbag2_player_for_test_srvs]: Failed to publish message on '/player_srvs_test_topic' topic.
Error: failed to publish loaned message: publisher's context is invalid, at ./src/rcl/publisher.c:402
[ERROR] [1681669769.008321010] [rosbag2_player_for_test_srvs.rclcpp]: Error in destruction of rcl publisher handle: Failed to delete datawriter, at ./src/publisher.cpp:45, at ./src/rcl/publisher.c:187
Failed to delete dds publisher from participantFailed to delete participant[ OK ] PlaySrvsTest.play_next (477 ms)
Directly related to this segmentation fault.
It seems some unexpected
failure happened on Fast-DDS side when we are trying to publish message and context has been deleted. Other DDS implementations gracefully handle such situation and Fast-DDS before also was handling it proper way. But it seems currently it causes segfault on process exit.
Some details and specific about PlaySrvsTest.play_next test:
https://github.com/ros2/rosbag2/blob/6f373bbaf80ef9eeb3698caabab373627f2f2bb8/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L385-L414
In this test we are starting rosbag2 player in pause mode and then calling play_next_service
call and waiting response to validate reception. Then we repeat such service call with response 200 times. This part of the test running correctly without any errors. After that we turning rosbag2 player in continuous playback mode with repeat and checking that call for service_call_play_next();
will return false since player not in pause mode. And that's all for this test. This part also running without issues.
The interesting part starting when we are trying to destruct ~PlaySrvsTest()
test fixture
https://github.com/ros2/rosbag2/blob/6f373bbaf80ef9eeb3698caabab373627f2f2bb8/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L66-L72
As you can see we are basically calling rclcpp::shutdown();
to shutdown any publishers and subscriptions and let other active threads exit. At the same time our player still in playback mode and publishing messages every 20 ms. This is why we are getting error message that publish(message)
called with invalid context.
In this particular scenario we can make workaround in test and stop player before calling rclcpp::shutdown();
. But this is integration test and calling rclcpp::shutdown();
somewhere in the middle of running nodes is a common practice in ROS community and this is how we are stopping or interrupting all nodes in case of SIGINT signal.
It will be more appropriate to make such scenario to be handled gracefully on Fast-DDS side and not causing segmentation fault on exit.
Some details about segmentation fault from debugger
Call stack for Thread-1
Thread 1 "test_play_servi" received signal SIGSEGV, Segmentation fault.
eprosima::fastrtps::rtps::TopicPayloadPool::shrink (this=this@entry=0x5603b9fd5940, max_num_payloads=0) at /home/morlov/ros2_rolling_new/src/eProsima/Fast-DDS/src/cpp/rtps/history/TopicPayloadPool.cpp:273
273 all_payloads_.at(payload->data_index()) = all_payloads_.back();
[libfastrtps.so.2.10] eprosima::fastrtps::rtps::TopicPayloadPool::shrink TopicPayloadPool.cpp:273
[libfastrtps.so.2.10] eprosima::fastrtps::rtps::TopicPayloadPool::release_history TopicPayloadPool.cpp:174
[libfastrtps.so.2.10] eprosima::fastdds::dds::DataWriterImpl::release_payload_pool DataWriterImpl.cpp:1950
[libfastrtps.so.2.10] eprosima::fastdds::dds::DataWriterImpl::~DataWriterImpl DataWriterImpl.cpp:440
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DataWriterImpl::~DataWriterImpl DataWriterImpl.hpp:43
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DataWriterImpl::~DataWriterImpl DataWriterImpl.hpp:43
[libfastrtps.so.2.10] eprosima::fastdds::dds::PublisherImpl::~PublisherImpl PublisherImpl.cpp:111
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::PublisherImpl::~PublisherImpl PublisherImpl.hpp:40
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::PublisherImpl::~PublisherImpl PublisherImpl.hpp:40
[libfastrtps.so.2.10] eprosima::fastdds::dds::DomainParticipantImpl::~DomainParticipantImpl DomainParticipantImpl.cpp:201
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DomainParticipantImpl::~DomainParticipantImpl DomainParticipantImpl.hpp:57
[libfastrtps.so.2.10] eprosima::fastdds::statistics::dds::DomainParticipantImpl::~DomainParticipantImpl DomainParticipantImpl.hpp:57
[libfastrtps.so.2.10] eprosima::fastdds::dds::DomainParticipantFactory::~DomainParticipantFactory DomainParticipantFactory.cpp:69
[libfastrtps.so.2.10] eprosima::fastdds::dds::DomainParticipantFactory::~DomainParticipantFactory DomainParticipantFactory.cpp:82
[libfastrtps.so.2.10] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release shared_ptr_base.h:155
[libfastrtps.so.2.10] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release shared_ptr_base.h:148
[libfastrtps.so.2.10] std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count shared_ptr_base.h:730
[libfastrtps.so.2.10] std::__shared_ptr<eprosima::fastdds::dds::DomainParticipantFactory, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr shared_ptr_base.h:1169
[libfastrtps.so.2.10] std::shared_ptr<eprosima::fastdds::dds::DomainParticipantFactory>::~shared_ptr shared_ptr.h:103
[libc.so.6] <unknown> 0x00007ff43eeab8a7
[libc.so.6] exit 0x00007ff43eeaba60
[libc.so.6] __libc_start_main 0x00007ff43ee8908a
[test_play_services__rmw_fastrtps_cpp] _start 0x00005603b8808c2e
Variables values when segmentation fault happened
Signal = SIGSEGV (Segmentation fault)
payload = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x810
buffer = {eprosima::fastrtps::rtps::octet *}
*buffer = {eprosima::fastrtps::rtps::octet}
data_offset = {const size_t} 12
this = {eprosima::fastrtps::rtps::PreallocatedTopicPayloadPool *const} 0x5603b9fd5940
eprosima::fastrtps::rtps::TopicPayloadPool = {eprosima::fastrtps::rtps::TopicPayloadPool}
eprosima::fastrtps::rtps::ITopicPayloadPool = {eprosima::fastrtps::rtps::ITopicPayloadPool}
max_pool_size_ = {uint32_t} 0
infinite_histories_count_ = {uint32_t} 0
finite_max_pool_size_ = {uint32_t} 0
free_payloads_ = {std::vector<eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *>}
all_payloads_ = {std::vector<eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *>}
[0] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603baacec70
[1] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603bac9a460
[2] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603ba6df240
[3] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603baee7b10
[4] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603b9f3cf10
[5] = {eprosima::fastrtps::rtps::TopicPayloadPool::PayloadNode *} 0x5603ba667be0
mutex_ = {std::mutex}
std::__mutex_base = {std::__mutex_base}
_M_mutex = {std::__mutex_base::__native_type}
__data = {__pthread_mutex_s}
__lock = {int} 1
__count = {unsigned int} 0
__owner = {int} 18782
__nusers = {unsigned int} 1
__kind = {int} 0
__spins = {short} 0
__elision = {short} 0
__list = {__pthread_list_t}
__prev = {__pthread_internal_list *} NULL
__next = {__pthread_internal_list *} NULL
__size = {char [40]}
__align = {long} 1
payload_size_ = {uint32_t} 52
minimum_pool_size_ = {uint32_t} 0
max_num_payloads = {uint32_t} 0
Stack tree and line of code where segmentation fault happened
@clalancette could you please help forward this issue to some engineers from eProsima who will be able to elaborate further on this issue?
@EduPonz @MiguelCompany can you take a look?
@Mario-DL will take a look at this. Thanks for the very thorough report @MichaelOrlov!
Hello everyone !
Thanks @MichaelOrlov for the detailed report. I could reproduce the exact rcl
error message and the latter TopicPayloadPool::shrink
segfault you mentioned. I will further investigate and will come back to you asap.
Hi @MichaelOrlov
I found that the segfault (assertion failing) in TopicPayloadPool::shrink
was introduced in ros2/rmw_fastrtps/568.
We will further elaborate on that because the problem can stem from the way in which loans()
are managed either in the rcl
or internally in Fast DDS
@Mario-DL Thanks for the finding breaking PR. Looking forward to seeing the fixes. It would be nice to have a fix or some workaround before the end of this Friday, May 19. To jump "on the train" for Iron final release. cc: @clalancette
Just to set expectations; unless this turns out to be a major bug, this is probably not going to make the Iron release. We only have about 48 hours in which to get anything new in, and changes to Fast-DDS or rmw_fastrtps (the default RMW vendor) would be pretty dangerous at this point.
Thanks @clalancette for the clarification.
I have been investigating further and following are the findings and errors characterization.
The checks were basically based on repeating the PlaySrvsTest.play_next
test multiple times (~50 each run).
ERR:01: failed to borrow loaned msg: publisher's context is invalid
This error is occasionally thrown by the rcl
. The cause is calling to rclcpp::shutdown()
before stopping the player's thread. The instance_id_storage
of the context
is set to 0
by the rclcpp::shutdown()
thread while in the player's thread, rcl_publisher_is_valid()
returns false, leading to the error.
This error log will be inevitable as long as the ~PlaySrvsTest()
is not corrected i.e calling rclcpp::shutdown()
after stopping and joining the player's thread.
ERR:02: Error in destruction of rcl publisher handle: Failed to delete datawriter
This error is sometimes thrown alongside with the previous one. If the rclcpp::shutdown()
is called after the player's thread had retrieved the loan, the destroy_publisher()
tries to delete_datawriter()
but, as there are loans in use
, it returns ReturnCode_t::RETCODE_PRECONDITION_NOT_MET
, triggering the error.
ERR:03: TopicPayloadPool::shrink() assertion failing
This assertion is thrown only after running all the repetitions, when the test is being destroyed and, consequently, the DomainParticipantFactory
singleton destructor is called.
This destructor removes all the created entities. When trying to delete the player's datawriter
, the TopicPayloadPool
(buffer for all the serialized messages) cannot be released since it detects that there is a loan that was not returned to the pool, falling into the assertion.
The working principle of loans is that, once a sample loan is borrowed to an external application or actor (in this case, the rcl
) it is responsibility of that application to return the loan for that message back, so that the resource can be freed.
In this case, the correct way of proceeding would be that rcl
stores the loans that it has borrowed (maybe inside each rcl_publisher_impl_s
struct ? ) in order to correctly return them when required and to check that all loans of a loan-capable entity were returned, before trying to delete it. That would deal with the errors ERR:01 and * ERR:02.
How do you see it ?
This issue has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2023-05-18/31587/1
@clalancette @wjwwood Need your help with the decision on this issue in regards to how to proceed with the fix.
In my opinion, the fix shall be done on Fast DDS
side and here is the reasoning for that:
- Fast DDS shouldn't cause segfault on domain participant destruction in any way.
- Other DDS implementations like
Cyclone DDS
andConnext DDS
gracefully handle such situation. - If we are deleting domain participant we imply that loaned memory from it is invalidated and will not be used anymore. And it seems a reasonable assumption in the general case.
@clalancette @wjwwood Friendly ping here. ^^^^ Need your decision/opinion on this task to be able to move forward with it.
@clalancette @wjwwood Friendly ping here. ^^^^
@clalancette Friendly ping here. ^^^
- Closing this issue since workaround was implemented in the https://github.com/ros2/rosbag2/pull/1556