rmw_fastrtps
rmw_fastrtps copied to clipboard
Reliable publisher/subscriber-connection drops messages in high frequency scenarios with build type Release
Bug report
Required Info:
- Operating System:
- Ubuntu 18.04
- Installation type:
- deb packages or source
- Version or commit hash:
- Eloquent or ros2 master branch (22. October 2019)
- DDS implementation:
- FastRTPS, OpenSplice, RTI connext
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
Create a simple publisher/subscriber pair, both having reliable QoS settings. Publish a fixed number of messages on the publisher side with a high frequency (>1kHZ -> <1ms intervals (e.g. 1us)).
Example: Talker:
class Talker : public rclcpp::Node
{
public:
DEMO_NODES_CPP_PUBLIC
explicit Talker(const rclcpp::NodeOptions& options)
: Node("talker", options)
{
setvbuf(stdout, NULL, _IONBF, BUFSIZ);
auto publish_message =
[this]() -> void
{
auto msg = std::make_unique<std_msgs::msg::Int32>();
msg->data = count_++;
pub_->publish(std::move(msg));
if (count_ > 100000)
{
timer_->cancel();
RCLCPP_INFO(this->get_logger(), "Sent 100000 messages");
}
};
rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
pub_ = this->create_publisher<std_msgs::msg::Int32>("chatter", qos);
RCLCPP_INFO(this->get_logger(), "Sending messages...");
timer_ = this->create_wall_timer(1us, publish_message);
}
private:
int32_t count_ = 1;
rclcpp::Publisher<std_msgs::msg::Int32>::SharedPtr pub_;
rclcpp::TimerBase::SharedPtr timer_;
};
Listener:
class Listener : public rclcpp::Node
{
public:
DEMO_NODES_CPP_PUBLIC
explicit Listener(const rclcpp::NodeOptions& options)
: Node("listener", options),
m_messageID(0),
m_numReceivedMsgs(0)
{
setvbuf(stdout, NULL, _IONBF, BUFSIZ);
auto callback =
[this](const std_msgs::msg::Int32::SharedPtr msg) -> void
{
++m_numReceivedMsgs;
int32_t msgDiff = msg->data - m_messageID;
m_messageID = msg->data;
};
rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
sub_ = create_subscription<std_msgs::msg::Int32>("chatter", qos, callback);
RCLCPP_INFO(this->get_logger(), "Ready to receive messages");
}
~Listener()
{
RCLCPP_INFO(this->get_logger(), "Received %d messages", m_numReceivedMsgs);
RCLCPP_INFO(this->get_logger(), "Lost %d messages", 100000 - m_numReceivedMsgs);
}
private:
rclcpp::Subscription<std_msgs::msg::Int32>::SharedPtr sub_;
int32_t m_messageID;
int32_t m_numReceivedMsgs;
};
Expected behavior
The subscriber receives all messages in order or an exception is thrown (or some other error propagation method).
Actual behavior
Some messages are published but not received on the subscriber side without any notification to the user or thrown errors. Basically behaves similar to best_effort strategy.
Additional information
A full implementation example which reproduces the error can be found here: https://github.com/DensoADAS/high_freq_pub_example.git This is basically the Talker/Listener demo from the demo_nodes_cpp package with a different message type for easier counting and a higher pub rate.
This problem occurs with various middleware implementations (as well as with the eloquent binaries and the ros2 master) and usually gets worse the higher the publisher frequency is. Also different QoS settings were tested (different depth/keepAll) but the problem still occurs (less frequently with keepAll but some messages are still dropped). The problem can be observed already with normal publishing rates, e.g. intervals between published messages of about 1ms.
With the simple example above the following dropped message counts were observed:
Sent messages: 10000 -> repeating 5x:
Publisher interval: 1us:
FastRTPS: 7541 7926 7192 8623 8255
OpenSplice: 4946 426 5181 3934 5253
RTI Connext: 8603 2755 1417 5355 5467
Publisher interval: 1ms:
FastRTPS: 53 0 33 1474 61
Sent messages: 100000 -> repeating 5x:
Publisher interval: 1us
FastRTPS: 96052 94911 96321 97322 95244
OpenSplice: 9369 7710 15950 51295 12835
RTI Connext: 45810 22624 41902 30460 21139
@DensoADAS
can you describe your platform detail? what hardware are you trying to use?
Multiple machines were tested, all got the issue:
- OS: Ubuntu 18.04 CPU: Intel Core i5-6500 @ 3.20GHz RAM: 24GB
- OS: Ubuntu 18.04 CPU: Intel Xeon E5-2680 v4 @ 2.40GHz RAM: 16GB
- OS: Ubuntu 18.04 (virtual machine on Windows 10 host) CPU: Intel Core i7-6820HQ @ 2.70GHz RAM: 8GB
The software on all machines is up-to-date. The test was executed on all machines using the eloquent binaries and in the VM also with the latest ros2 master.
Tell me if you need more information.
Thanks for reporting this issue! I was able to reproduce this behaviour by sending 10 msgs in frequencies ranging from 1us to 1s. The higher frequency messages were sent in, the more loss I observed. It seems like Opensplice has relatively the most reliable delivery among all middleware implementations, with Fastrtps being the worst. This also confirms the dropped msgs counts. I'm currently not sure what caused this problem but will look into it.
I tried to reproduce the problem and the first thing I noticed is that the talker is starting to publish messages right away without waiting for the subscriber to be discovered and matched.
You can easily reproduce that by printing msg->data
for the first received message in the listener - it is greater than 1 and the offset matches the missing message count exactly.
Adding a sleep in the talker after creating the publisher but before creating the timer (e.g. std::this_thread::sleep_for(std::chrono::seconds(1));
) lets the discovery happen and match the two endpoints before the first message is published. With that change the listener receives all the messages for me.
Diff to sleep before publishing and print the first received msg->data:
diff --git a/src/listener.cpp b/src/listener.cpp
index 52b2bca..7f29d73 100644
--- a/src/listener.cpp
+++ b/src/listener.cpp
@@ -26,6 +26,10 @@ public:
auto callback =
[this](const std_msgs::msg::Int32::SharedPtr msg) -> void
{
+ if (first) {
+ first = false;
+ fprintf(stderr, "%d\n", msg->data);
+ }
++m_numReceivedMsgs;
int32_t msgDiff = msg->data - m_messageID;
m_messageID = msg->data;
@@ -51,6 +55,7 @@ public:
}
private:
+ bool first = true;
rclcpp::Subscription<std_msgs::msg::Int32>::SharedPtr sub_;
int32_t m_messageID;
int32_t m_numReceivedMsgs;
diff --git a/src/talker.cpp b/src/talker.cpp
index f57cd3c..62c067e 100644
--- a/src/talker.cpp
+++ b/src/talker.cpp
@@ -41,6 +41,7 @@ public:
rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
pub_ = this->create_publisher<std_msgs::msg::Int32>("chatter", qos);
+ std::this_thread::sleep_for(std::chrono::seconds(1));
RCLCPP_INFO(this->get_logger(), "Sending messages...");
timer_ = this->create_wall_timer(1us, publish_message);
}
Please comment if you can reproduce this and if there is any problem remaining.
@dirk-thomas The sleep fixes the missing messages at the beginning but the general problem still exists. Messages are dropped (quite randomly as it seems) throughout the publishing process; you can check this by adding the commented out code in line 34 of the listener:
RCLCPP_ERROR(this->get_logger(), "Lost %d msgs, curr: %d", msgDiff - 1, msg->data);
With the sleep and 100000 sent messages I still get around 80000 dropped ones with FastRTPS and the eloquent binaries.
@claireyywang @dirk-thomas @Tobi2001
confirmed this problem with Fast-RTPS. the more frequency, the more losses happen.
I looked into fastrtps and it is mentioned that in high rate or large data scenarios, there are a couple of parameters need to be tuned to avoid network package drop. Based on the documentation and presentation slides, it is possible to tune those parameters using an xml
file.
The sleep fixes the missing messages at the beginning but the general problem still exists.
I can't reproduce the problem. With the sleep in place (and the error message commented in) the listener still reports the following for me with no error messages:
[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages
@DensoADAS @Tobi2001 @fujitatomoya
@dirk-thomas and I did more testing and found out that by default (without any args), fastrtps
build type is release
, which caused the ~70% msg drop. But if we build debug
instead, which is what @dirk-thomas used, the performance improves significantly: msg drop around 0-10%.
Here are the steps we took:
cd ~/ros2_ws/build/
rm -rf fastrtps # delete fastrtps build folder from workspace
cd ~/ros2_ws # go to the root of workspace
colcon build --packages-select fastrtps --cmake-args -DCMAKE_BUILD_TYPE=Debug # rebuild in debug type
Please let me know if you are able to get better performance after trying it.
@richiprosima The behavior we are seeing here is pretty weird. When building FastRTPS with the build type Release
it drops a very significant number of messages (~60%-70%). When building Debug
it surprisingly behaves much better - it drops none (or sometimes up to 10%).
DensoADAS/high_freq_pub_example#1 contains the above example with the additional sleep to ensure the endpoints have matched.
I already tried to get rid of all conditional logic in FastRTPS (depending on __DEBUG
and NDEBUG
) but that didn't make any difference. Any insight would be more than welcome.
@DensoADAS
i confirmed that cyclonedds does not have any message drop. i have tested more than 10 times with cyclonedds, can you confirm for double check?
[Subscriber]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub listener
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages
[Publisher]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
@claireyywang @dirk-thomas it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?
@fujitatomoya please try to reproduce the behavior with the different build types of FastRTPS mentioned above before adding another dimension to the problem.
@dirk-thomas
When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).
confirmed almost same behavior.
confirmed almost same behavior.
Thanks for checking.
Since the current ticket indicates a problem with FastRTPS I moved the ticket to the rmw_fastrtps
repository.
it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?
If there are similar problems with other RMW implementations please open a separate ticket in their RMW repo - or if applicable to all RMW impl. in one of the general ROS 2 repos.
The behavior we are seeing here is pretty weird. When building FastRTPS with the build type
Release
it drops a very significant number of messages (~60%-70%). When buildingDebug
it surprisingly behaves much better - it drops none (or sometimes up to 10%).
@richiprosima Can you please comment here that you have seen this ticket and that you will look into the problem. The ticket should have enough information to reproduce the problem. This seems to be a pretty severe performance problem with the default build type (Release) of FastRTPS.
I have tried cyclonedds and I can't get any drops as well on a real machine (release mode). In a VM though there are some, definitely way less than in the other dds implementations though (in about 30% of tries I get drops -> 0-2% messages).
Also debug mode works better in comparison for all dds implementations but with FastRTPS about 50% of the sent messages still can't be received (the runtime is increasing severely though, which might shadow the actual problem)
I was able to do a quick digging. Our first thought was it is a problem with QoS configuration. In particular with the History depth and the Asynchronous thread. By default in ROS2, FastRTPS and Connext uses the asynchronous thread. This implies the user's sample is stored in the History and another thread is in charge of sent it, not the user's thread directly. Due to the example is writing so fast maybe the history is getting full and removing sample before being sent by the async thread. I made the test in my machine using localhost:
- Default ROS2 configuration:
[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 89098 messages
[INFO] [listener]: Lost 10902 messages
- Changing QoS to use sync thread:
[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages
FYI. We are working to support sending large data using the synchronous thread. Then we can use it by default in ROS2.
@richiware
this QoS for sync thread you mentioned previous comment, that is for fastrtps native setting, right? (not ROS2 QoS)
could you tell us how/what to set the QoS for fastrtps to enable sync thread? i will confirm it on my side as well.
Write a DEFAULT_FASTRTPS_PROFILES.xml file in the directory where you are launching the executables. The XML content:
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher_default"
is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
</profiles>
</dds>
When launching the talker, use this command:
RMW_FASTRTPS_USE_QOS_FROM_XML=1 ./build/high_freq_pub/talker
@richiware While the custom config might workaround the problem I would like to find out why a release build has so much worse performance than a debug build.
@dirk-thomas
My guess: in the release build you are filling the buffers faster than in the debug build.
My guess: in the release build you are filling the buffers faster than in the debug build.
@JaimeMartin I don't understand how this statement makes any sense. Both builds use the exact same application with a fixed frequency. How can the debug build keep up with the rate and not loose any packages but the release build drops 60-70%?
@dirk-thomas because perhaps in the debug release you are not really sending at that frequency as it takes more time to call write. I have read they are using periods of sleep of 1 uS.
We will have more details soon.
perhaps in the debug release you are not really sending at that frequency as it takes more time to call write.
@JaimeMartin I added measuring the time it takes to do the publishing in DensoADAS/high_freq_pub_example#3. For me the put is the following which doesn't indicate that the release build is publishing data faster at all:
- Debug build:
[INFO] [talker]: Sent 100000 messages in 13,772,113,709 ns
- Release build:
[INFO] [talker]: Sent 100000 messages in 15,871,665,372 ns
@dirk-thomas thanks for the experiment. We are working on this now, we will update you as soon we have more information.
@dirk-thomas @JaimeMartin @richiware
confirmed no messages dropped with the following QoS profile,
<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher_default"
is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
</profiles>
root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages
root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub listener
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages
Dear all,
About this problem, I have different opinion. I don't think 'drop' behavior is a problem.
'Reliable' is Qos for DDS layer. In specification (Data Distribution Service, v1.4, page 98), 'Reliable' means "Specifies the Service will attempt to deliver all samples in its history. Missed samples may be retried." Based on my understanding, it make sure sample in history can be delivered successfully. That is, once sample is delivered, delivery must be successful (make sure sample is received). Not make sure all samples in history must be delivered (In this case, samples should be dropped from history).
While the speed of importing samples is faster than the speed of DDS send, 'Drop' behavior must exists except HISTORY is set as KEEP_ALL.
For test sample about talker and listener, use history depth with 1000, but send data samples count is 10000, this must cause the problem.
From fastrtps publisher side, publisher post data into the publisher history, and use other thread to send data.
If fastrtps send data by transport is slower than ros2 publish data, the data might be lost(removed) since history depth limitation.
```c++
bool PublisherHistory::add_pub_change(
CacheChange_t* change,
WriteParams &wparams,
std::unique_lock<RecursiveTimedMutex>& lock,
std::chrono::time_point<std::chrono::steady_clock> max_blocking_time)
{
if(m_isHistoryFull)
{
bool ret = false;
if(m_historyQos.kind == KEEP_ALL_HISTORY_QOS)
{
ret = this->mp_writer->try_remove_change(max_blocking_time, lock);
}
else if(m_historyQos.kind == KEEP_LAST_HISTORY_QOS)
{
ret = this->remove_min_change();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <== delete the published data that might be not send
}
```
From fastrtps subscriber side, subscriber get data into the subscriber history, and ros2 use other thread to call subscriber_callback.
If the ros2 subscriber callback (ros2 got data from fastrtps) is slower than rtps receiving data, the data might be lost.
```
bool SubscriberHistory::received_change_keep_last_no_key(
CacheChange_t* a_change,
size_t /* unknown_missing_changes_up_to */ )
{
bool add = false;
if (m_changes.size() < static_cast<size_t>(m_historyQos.depth) )
{
add = true;
}
else
{
// Try to substitute the oldest sample.
// As the history should be ordered following the presentation QoS, we can always remove the first one.
add = remove_change_sub(m_changes.at(0));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <== delete the subscriber data that might not be got by ros2
}
if (add)
{
return add_received_change(a_change);
}
return false;
}
```
NOTE: Is this behavior a problem?
- use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.
root@bdk:~/workspace# ros2 run high_freq_pub listener [INFO] [listener]: Ready to receive messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2) [INFO] [listener]: Received 100000 messages [INFO] [listener]: Lost 0 messages
root@bdk:~/workspace# ros2 run high_freq_pub talker [INFO] [talker]: Sending messages... [INFO] [talker]: Sent 100000 messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2)
Thank you very much @Barry-Xu-2018 for your thorough explanation
I would like to add what happens when not using the asynchronous writing thread. In that case, the thread adding the sample is blocked until the RTPS datagram is sent to the network, so each sample is added after the previous one has been physically put on the socket output buffer, and then the publisher doesn't drop any message
While all of this is correct it doesn't explain why it works in debug mode but fails in release mode. And that is the pending question to answer.
@Barry-Xu-2018 @MiguelCompany
use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.
are you sure about this? this is actually what I've done before, I receive the following error from rmw_fastrtps right after start publishing data.
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): failed to publish message: cannot publish data, at /root/ros2_colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:52, at /root/ros2_colcon_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:290
i did not have time to debug the Fast-RTPS implementation, but it seems that new change does not get applied via create_new_change().
@DensoADAS
could you please confirm if you have the same runtime error with following? https://github.com/DensoADAS/high_freq_pub_example/pull/4
@fujitatomoya Can confirm the exception which probably happens because the resource limit is exceeded. When catching the exception and resending the message (see this branch for an example) all messages are received by the subscriber.
The runtime of the FastRTPS dds is quite high though in comparison to other vendors:
Time until all messages (100000) are published
FastRTPS: ~20min
OpenSplice: ~14s
CycloneDDS: ~13s
Also the workaround using the XML file for FastRTPS does not really change anything for me. The only difference is that I get outputs like the following:
2019-11-25 09:47:58.770 [RTPS_HISTORY Error] Change payload size of '96' bytes is larger than the history payload size of '35' bytes and cannot be resized. -> Function add_change