rmw_fastrtps Reliable publisher/subscriber-connection drops messages in high frequency scenarios with build type Release

Bug report

Required Info:

Operating System:
- Ubuntu 18.04
Installation type:
- deb packages or source
Version or commit hash:
- Eloquent or ros2 master branch (22. October 2019)
DDS implementation:
- FastRTPS, OpenSplice, RTI connext
Client library (if applicable):
- rclcpp

Steps to reproduce issue

Create a simple publisher/subscriber pair, both having reliable QoS settings. Publish a fixed number of messages on the publisher side with a high frequency (>1kHZ -> <1ms intervals (e.g. 1us)).

Example: Talker:

class Talker : public rclcpp::Node
{
public:
    DEMO_NODES_CPP_PUBLIC
    explicit Talker(const rclcpp::NodeOptions& options)
        : Node("talker", options)
    {
        setvbuf(stdout, NULL, _IONBF, BUFSIZ);
        auto publish_message =
            [this]() -> void
            {
                auto msg = std::make_unique<std_msgs::msg::Int32>();
                msg->data = count_++;
                pub_->publish(std::move(msg));

                if (count_ > 100000)
                {
                    timer_->cancel();
                    RCLCPP_INFO(this->get_logger(), "Sent 100000 messages");
                }
            };

        rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
        pub_ = this->create_publisher<std_msgs::msg::Int32>("chatter", qos);

        RCLCPP_INFO(this->get_logger(), "Sending messages...");
        timer_ = this->create_wall_timer(1us, publish_message);
    }

private:
    int32_t count_ = 1;
    rclcpp::Publisher<std_msgs::msg::Int32>::SharedPtr pub_;
    rclcpp::TimerBase::SharedPtr timer_;
};

Listener:

class Listener : public rclcpp::Node
{
public:
    DEMO_NODES_CPP_PUBLIC
    explicit Listener(const rclcpp::NodeOptions& options)
        : Node("listener", options),
          m_messageID(0),
          m_numReceivedMsgs(0)
    {
        setvbuf(stdout, NULL, _IONBF, BUFSIZ);
        auto callback =
            [this](const std_msgs::msg::Int32::SharedPtr msg) -> void
            {
                ++m_numReceivedMsgs;
                int32_t msgDiff = msg->data - m_messageID;
                m_messageID = msg->data;
            };

        rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
        sub_ = create_subscription<std_msgs::msg::Int32>("chatter", qos, callback);
        RCLCPP_INFO(this->get_logger(), "Ready to receive messages");
    }

    ~Listener()
    {
        RCLCPP_INFO(this->get_logger(), "Received %d messages", m_numReceivedMsgs);
        RCLCPP_INFO(this->get_logger(), "Lost %d messages", 100000 - m_numReceivedMsgs);
    }

private:
    rclcpp::Subscription<std_msgs::msg::Int32>::SharedPtr sub_;
    int32_t m_messageID;
    int32_t m_numReceivedMsgs;
};

Expected behavior

The subscriber receives all messages in order or an exception is thrown (or some other error propagation method).

Actual behavior

Some messages are published but not received on the subscriber side without any notification to the user or thrown errors. Basically behaves similar to best_effort strategy.

Additional information

A full implementation example which reproduces the error can be found here: https://github.com/DensoADAS/high_freq_pub_example.git This is basically the Talker/Listener demo from the demo_nodes_cpp package with a different message type for easier counting and a higher pub rate.

This problem occurs with various middleware implementations (as well as with the eloquent binaries and the ros2 master) and usually gets worse the higher the publisher frequency is. Also different QoS settings were tested (different depth/keepAll) but the problem still occurs (less frequently with keepAll but some messages are still dropped). The problem can be observed already with normal publishing rates, e.g. intervals between published messages of about 1ms.

With the simple example above the following dropped message counts were observed:

Sent messages: 10000 -> repeating 5x:
 
Publisher interval: 1us:           
FastRTPS:       7541    7926    7192    8623    8255
OpenSplice:     4946    426     5181    3934    5253
RTI Connext:    8603    2755    1417    5355    5467 

Publisher interval: 1ms:
FastRTPS:       53      0       33      1474    61


Sent messages: 100000 -> repeating 5x:

Publisher interval: 1us   
FastRTPS:       96052   94911   96321   97322   95244
OpenSplice:     9369    7710    15950   51295   12835
RTI Connext:    45810   22624   41902   30460   21139

Oct 28 '19 15:10 DensoADAS

@DensoADAS

can you describe your platform detail? what hardware are you trying to use?

Oct 29 '19 00:10 fujitatomoya

Multiple machines were tested, all got the issue:

OS: Ubuntu 18.04 CPU: Intel Core i5-6500 @ 3.20GHz RAM: 24GB
OS: Ubuntu 18.04 CPU: Intel Xeon E5-2680 v4 @ 2.40GHz RAM: 16GB
OS: Ubuntu 18.04 (virtual machine on Windows 10 host) CPU: Intel Core i7-6820HQ @ 2.70GHz RAM: 8GB

The software on all machines is up-to-date. The test was executed on all machines using the eloquent binaries and in the VM also with the latest ros2 master.

Tell me if you need more information.

Oct 29 '19 06:10 DensoADAS

Thanks for reporting this issue! I was able to reproduce this behaviour by sending 10 msgs in frequencies ranging from 1us to 1s. The higher frequency messages were sent in, the more loss I observed. It seems like Opensplice has relatively the most reliable delivery among all middleware implementations, with Fastrtps being the worst. This also confirms the dropped msgs counts. I'm currently not sure what caused this problem but will look into it.

Nov 15 '19 18:11 claireyywang

I tried to reproduce the problem and the first thing I noticed is that the talker is starting to publish messages right away without waiting for the subscriber to be discovered and matched.

You can easily reproduce that by printing msg->data for the first received message in the listener - it is greater than 1 and the offset matches the missing message count exactly.

Adding a sleep in the talker after creating the publisher but before creating the timer (e.g. std::this_thread::sleep_for(std::chrono::seconds(1));) lets the discovery happen and match the two endpoints before the first message is published. With that change the listener receives all the messages for me.

Diff to sleep before publishing and print the first received msg->data:

diff --git a/src/listener.cpp b/src/listener.cpp
index 52b2bca..7f29d73 100644
--- a/src/listener.cpp
+++ b/src/listener.cpp
@@ -26,6 +26,10 @@ public:
         auto callback =
             [this](const std_msgs::msg::Int32::SharedPtr msg) -> void
             {
+                if (first) {
+                    first = false;
+                    fprintf(stderr, "%d\n", msg->data);
+                }
                 ++m_numReceivedMsgs;
                 int32_t msgDiff = msg->data - m_messageID;
                 m_messageID = msg->data;
@@ -51,6 +55,7 @@ public:
     }
 
 private:
+    bool first = true;
     rclcpp::Subscription<std_msgs::msg::Int32>::SharedPtr sub_;
     int32_t m_messageID;
     int32_t m_numReceivedMsgs;
diff --git a/src/talker.cpp b/src/talker.cpp
index f57cd3c..62c067e 100644
--- a/src/talker.cpp
+++ b/src/talker.cpp
@@ -41,6 +41,7 @@ public:
         rclcpp::QoS qos = rclcpp::QoS(rclcpp::KeepLast(1000)).reliable();
         pub_ = this->create_publisher<std_msgs::msg::Int32>("chatter", qos);
 
+        std::this_thread::sleep_for(std::chrono::seconds(1));
         RCLCPP_INFO(this->get_logger(), "Sending messages...");
         timer_ = this->create_wall_timer(1us, publish_message);
     }

Please comment if you can reproduce this and if there is any problem remaining.

Nov 16 '19 05:11 dirk-thomas

@dirk-thomas The sleep fixes the missing messages at the beginning but the general problem still exists. Messages are dropped (quite randomly as it seems) throughout the publishing process; you can check this by adding the commented out code in line 34 of the listener:

RCLCPP_ERROR(this->get_logger(), "Lost %d msgs, curr: %d", msgDiff - 1, msg->data);

With the sleep and 100000 sent messages I still get around 80000 dropped ones with FastRTPS and the eloquent binaries.

Nov 16 '19 12:11 Tobi2001

@claireyywang @dirk-thomas @Tobi2001

confirmed this problem with Fast-RTPS. the more frequency, the more losses happen.

Nov 18 '19 07:11 fujitatomoya

I looked into fastrtps and it is mentioned that in high rate or large data scenarios, there are a couple of parameters need to be tuned to avoid network package drop. Based on the documentation and presentation slides, it is possible to tune those parameters using an xml file.

Nov 19 '19 01:11 claireyywang

The sleep fixes the missing messages at the beginning but the general problem still exists.

I can't reproduce the problem. With the sleep in place (and the error message commented in) the listener still reports the following for me with no error messages:

[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages

Nov 19 '19 17:11 dirk-thomas

@DensoADAS @Tobi2001 @fujitatomoya

@dirk-thomas and I did more testing and found out that by default (without any args), fastrtps build type is release, which caused the ~70% msg drop. But if we build debug instead, which is what @dirk-thomas used, the performance improves significantly: msg drop around 0-10%.

Here are the steps we took:

  cd ~/ros2_ws/build/ 
  rm -rf fastrtps  # delete fastrtps build folder from workspace
  cd ~/ros2_ws # go to the root of workspace
  colcon build --packages-select fastrtps --cmake-args -DCMAKE_BUILD_TYPE=Debug  # rebuild in debug type

Please let me know if you are able to get better performance after trying it.

Nov 19 '19 18:11 claireyywang

@richiprosima The behavior we are seeing here is pretty weird. When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).

DensoADAS/high_freq_pub_example#1 contains the above example with the additional sleep to ensure the endpoints have matched.

I already tried to get rid of all conditional logic in FastRTPS (depending on __DEBUG and NDEBUG) but that didn't make any difference. Any insight would be more than welcome.

Nov 19 '19 20:11 dirk-thomas

@DensoADAS

i confirmed that cyclonedds does not have any message drop. i have tested more than 10 times with cyclonedds, can you confirm for double check?

[Subscriber]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub listener 
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages

[Publisher]
root@6cc2c62d0dd0:~/ros2_colcon_ws# export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)

@claireyywang @dirk-thomas it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?

Nov 21 '19 03:11 fujitatomoya

@fujitatomoya please try to reproduce the behavior with the different build types of FastRTPS mentioned above before adding another dimension to the problem.

Nov 21 '19 03:11 dirk-thomas

@dirk-thomas

When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).

confirmed almost same behavior.

Nov 21 '19 04:11 fujitatomoya

confirmed almost same behavior.

Thanks for checking.

Since the current ticket indicates a problem with FastRTPS I moved the ticket to the rmw_fastrtps repository.

it surely depends on dds implementation, need some help from each dds vendor. do you want me to create issue for each rmw implementation such as rmw_fastrtps?

If there are similar problems with other RMW implementations please open a separate ticket in their RMW repo - or if applicable to all RMW impl. in one of the general ROS 2 repos.

The behavior we are seeing here is pretty weird. When building FastRTPS with the build type Release it drops a very significant number of messages (~60%-70%). When building Debug it surprisingly behaves much better - it drops none (or sometimes up to 10%).

@richiprosima Can you please comment here that you have seen this ticket and that you will look into the problem. The ticket should have enough information to reproduce the problem. This seems to be a pretty severe performance problem with the default build type (Release) of FastRTPS.

Nov 21 '19 04:11 dirk-thomas

I have tried cyclonedds and I can't get any drops as well on a real machine (release mode). In a VM though there are some, definitely way less than in the other dds implementations though (in about 30% of tries I get drops -> 0-2% messages).

Also debug mode works better in comparison for all dds implementations but with FastRTPS about 50% of the sent messages still can't be received (the runtime is increasing severely though, which might shadow the actual problem)

Nov 21 '19 14:11 DensoADAS

I was able to do a quick digging. Our first thought was it is a problem with QoS configuration. In particular with the History depth and the Asynchronous thread. By default in ROS2, FastRTPS and Connext uses the asynchronous thread. This implies the user's sample is stored in the History and another thread is in charge of sent it, not the user's thread directly. Due to the example is writing so fast maybe the history is getting full and removing sample before being sent by the async thread. I made the test in my machine using localhost:

Default ROS2 configuration:

[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 89098 messages
[INFO] [listener]: Lost 10902 messages

Changing QoS to use sync thread:

[INFO] [listener]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener]: Received 100000 messages
[INFO] [listener]: Lost 0 messages

FYI. We are working to support sending large data using the synchronous thread. Then we can use it by default in ROS2.

Nov 21 '19 14:11 richiware

@richiware

this QoS for sync thread you mentioned previous comment, that is for fastrtps native setting, right? (not ROS2 QoS)

could you tell us how/what to set the QoS for fastrtps to enable sync thread? i will confirm it on my side as well.

Nov 21 '19 15:11 fujitatomoya

Write a DEFAULT_FASTRTPS_PROFILES.xml file in the directory where you are launching the executables. The XML content:

<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher_default"
        is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
    </profiles>
</dds>

When launching the talker, use this command:

RMW_FASTRTPS_USE_QOS_FROM_XML=1 ./build/high_freq_pub/talker

Nov 21 '19 15:11 richiware

@richiware While the custom config might workaround the problem I would like to find out why a release build has so much worse performance than a debug build.

Nov 21 '19 15:11 dirk-thomas

@dirk-thomas

My guess: in the release build you are filling the buffers faster than in the debug build.

Nov 21 '19 16:11 JaimeMartin

My guess: in the release build you are filling the buffers faster than in the debug build.

@JaimeMartin I don't understand how this statement makes any sense. Both builds use the exact same application with a fixed frequency. How can the debug build keep up with the rate and not loose any packages but the release build drops 60-70%?

Nov 21 '19 16:11 dirk-thomas

@dirk-thomas because perhaps in the debug release you are not really sending at that frequency as it takes more time to call write. I have read they are using periods of sleep of 1 uS.

We will have more details soon.

Nov 21 '19 16:11 JaimeMartin

perhaps in the debug release you are not really sending at that frequency as it takes more time to call write.

@JaimeMartin I added measuring the time it takes to do the publishing in DensoADAS/high_freq_pub_example#3. For me the put is the following which doesn't indicate that the release build is publishing data faster at all:

Debug build: [INFO] [talker]: Sent 100000 messages in 13,772,113,709 ns
Release build: [INFO] [talker]: Sent 100000 messages in 15,871,665,372 ns

Nov 21 '19 16:11 dirk-thomas

@dirk-thomas thanks for the experiment. We are working on this now, we will update you as soon we have more information.

Nov 21 '19 16:11 JaimeMartin

@dirk-thomas @JaimeMartin @richiware

confirmed no messages dropped with the following QoS profile,

<?xml version="1.0" encoding="utf-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher_default"
        is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
    </profiles>

root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub talker 
[INFO] [talker_high_freq]: Sending messages...
[INFO] [talker_high_freq]: Sent 100000 messages

root@6cc2c62d0dd0:~/ros2_colcon_ws# RMW_FASTRTPS_USE_QOS_FROM_XML=1 ros2 run high_freq_pub listener 
[INFO] [listener_high_freq]: Ready to receive messages
^C[INFO] [rclcpp]: signal_handler(signal_value=2)
[INFO] [listener_high_freq]: Received 100000 messages
[INFO] [listener_high_freq]: Lost 0 messages

Nov 22 '19 06:11 fujitatomoya

Dear all,

About this problem, I have different opinion. I don't think 'drop' behavior is a problem.

'Reliable' is Qos for DDS layer. In specification (Data Distribution Service, v1.4, page 98), 'Reliable' means "Specifies the Service will attempt to deliver all samples in its history. Missed samples may be retried." Based on my understanding, it make sure sample in history can be delivered successfully. That is, once sample is delivered, delivery must be successful (make sure sample is received). Not make sure all samples in history must be delivered (In this case, samples should be dropped from history).

While the speed of importing samples is faster than the speed of DDS send, 'Drop' behavior must exists except HISTORY is set as KEEP_ALL.

For test sample about talker and listener, use history depth with 1000, but send data samples count is 10000, this must cause the problem.

From fastrtps publisher side, publisher post data into the publisher history, and use other thread to send data.
If fastrtps send data by transport is slower than ros2 publish data, the data might be lost(removed) since history depth limitation.

```c++
    bool PublisherHistory::add_pub_change(
            CacheChange_t* change,
            WriteParams &wparams,
            std::unique_lock<RecursiveTimedMutex>& lock,
            std::chrono::time_point<std::chrono::steady_clock> max_blocking_time)
    {
        if(m_isHistoryFull)
        {
            bool ret = false;

            if(m_historyQos.kind == KEEP_ALL_HISTORY_QOS)
            {
                ret = this->mp_writer->try_remove_change(max_blocking_time, lock);
            }
            else if(m_historyQos.kind == KEEP_LAST_HISTORY_QOS)
            {
                ret = this->remove_min_change();
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^      <== delete the published data that might be not send
            }
```

From fastrtps subscriber side, subscriber get data into the subscriber history, and ros2 use other thread to call subscriber_callback.
If the ros2 subscriber callback (ros2 got data from fastrtps) is slower than rtps receiving data, the data might be lost. ``` bool SubscriberHistory::received_change_keep_last_no_key( CacheChange_t* a_change, size_t /* unknown_missing_changes_up_to */ ) { bool add = false; if (m_changes.size() < static_cast<size_t>(m_historyQos.depth) ) { add = true; } else { // Try to substitute the oldest sample.

        // As the history should be ordered following the presentation QoS, we can always remove the first one.
        add = remove_change_sub(m_changes.at(0));
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       <== delete the subscriber data that might not be got by ros2
    }

    if (add)
    {
        return add_received_change(a_change);
    }

    return false;
}
```

NOTE: Is this behavior a problem?

use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.

root@bdk:~/workspace# ros2 run high_freq_pub listener [INFO] [listener]: Ready to receive messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2) [INFO] [listener]: Received 100000 messages [INFO] [listener]: Lost 0 messages

root@bdk:~/workspace# ros2 run high_freq_pub talker [INFO] [talker]: Sending messages... [INFO] [talker]: Sent 100000 messages ^C[INFO] [rclcpp]: signal_handler(signal_value=2)

Nov 22 '19 08:11 Barry-Xu-2018

Thank you very much @Barry-Xu-2018 for your thorough explanation

I would like to add what happens when not using the asynchronous writing thread. In that case, the thread adding the sample is blocked until the RTPS datagram is sent to the network, so each sample is added after the previous one has been physically put on the socket output buffer, and then the publisher doesn't drop any message

Nov 22 '19 10:11 MiguelCompany

While all of this is correct it doesn't explain why it works in debug mode but fails in release mode. And that is the pending question to answer.

Nov 22 '19 14:11 dirk-thomas

@Barry-Xu-2018 @MiguelCompany

use rclcpp::KeepAll() / rclcpp::KeepLast(100000) to instead of rclcpp::KeepLast(1000), it works good.

are you sure about this? this is actually what I've done before, I receive the following error from rmw_fastrtps right after start publishing data.

root@6cc2c62d0dd0:~/ros2_colcon_ws# ros2 run high_freq_pub talker 
[INFO] [talker_high_freq]: Sending messages...
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  failed to publish message: cannot publish data, at /root/ros2_colcon_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:52, at /root/ros2_colcon_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:290

i did not have time to debug the Fast-RTPS implementation, but it seems that new change does not get applied via create_new_change().

@DensoADAS

could you please confirm if you have the same runtime error with following? https://github.com/DensoADAS/high_freq_pub_example/pull/4

Nov 25 '19 07:11 fujitatomoya

@fujitatomoya Can confirm the exception which probably happens because the resource limit is exceeded. When catching the exception and resending the message (see this branch for an example) all messages are received by the subscriber.

The runtime of the FastRTPS dds is quite high though in comparison to other vendors:

Time until all messages (100000) are published
    FastRTPS: ~20min
    OpenSplice: ~14s
    CycloneDDS: ~13s

Also the workaround using the XML file for FastRTPS does not really change anything for me. The only difference is that I get outputs like the following:

2019-11-25 09:47:58.770 [RTPS_HISTORY Error] Change payload size of '96' bytes is larger than the history payload size of '35' bytes and cannot be resized. -> Function add_change

Nov 25 '19 09:11 DensoADAS

rmw_fastrtps rmw_fastrtps copied to clipboard

Reliable publisher/subscriber-connection drops messages in high frequency scenarios with build type Release

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

rmw_fastrtps
rmw_fastrtps copied to clipboard