Fast-DDS When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network

Is there an already existing issue for this?

[X] I have searched the existing issues

Expected behavior

There are 20 processes and a total of 130 topics running on the same machine
QOS：both UDP and SHM are enabled；udp_transport->interfaceWhiteList.push_back(127.0.0.1); This means that discovery traffic uses a 127.0.0.1 for udp communication and user data uses shm communication.
When these processes start at the same time，we expect no packet loss on the 127.0.0.1 that can be seen by the ifconfig lo

Current behavior

When these processes start at the same time，There are many packet loss on the 127.0.0.1 that can be seen by the ifconfig lo

We have tried many ways, but nothing has worked:

Increase the buffer sizes of network adapters sudo sysctl -w net.core.wmem_max=209715200 //200M sudo sysctl -w net.core.rmem_max=209715200 //200M
Increase the socket buffer size in the QOS "send_socket_buffer_size": 209715200, //200M "listen_socket_buffer_size": 209715200
Increase the txqueuelen length ip link set txqueuelen 10000 dev lo

Can you help me solve this problem?

Steps to reproduce

above

Fast DDS version/commit

v2.12.0

Platform/Architecture

Ubuntu Focal 20.04 arm64

Transport layer

Default configuration, UDPv4 & SHM

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

Apr 08 '24 01:04 TechVortexZ

Hi @TechVortexZ, thanks for using Fast DDS. You might consider that 20 processes and 130 topics are enough to make the network very busy so the loss can be related to this. If the loss is mostly in the discovery phase, you can try changing the initial announcement period: decreasing it will allow participants to be discovered more quickly, while increasing it will reduce the frequency of sending metatraffic packages, leading to a less busy network. Please let us know if you can get better performance with one of these solutions. Also, please note that version 2.12.x is end of life, so you may want to consider upgrading to our latest version 2.14.x.

Apr 08 '24 08:04 elianalf

Hi @TechVortexZ, thanks for using Fast DDS. You might consider that 20 processes and 130 topics are enough to make the network very busy so the loss can be related to this. If the loss is mostly in the discovery phase, you can try changing the initial announcement period: increasing it will allow participants to be discovered more quickly, while decreasing it will reduce the frequency of sending metatraffic packages, leading to a less busy network. Please let us know if you can get better performance with one of these solutions. Also, please note that version 2.12.x is end of life, so you may want to consider upgrading to our latest version 2.14.x.

Hi @elianalf, We decrease initial announcement period "initial_announce_count": 5, "initial_announce_period": 100ms, But there are still lost packets.

When we modify this configuration "avoid_builtin_multicast": false,, there are no lost packets. Can you tell me the function of this parameter, why to solve this problem.

Apr 09 '24 06:04 TechVortexZ

However,I noticed that the pdp message interval is not 100ms when start, I set initial_announce_period": 100ms,This is why? HPo8yc3ZdO

Apr 09 '24 09:04 TechVortexZ

Hi,

When we modify this configuration "avoid_builtin_multicast": false,, there are no lost packets. Can you tell me the function of this parameter, why to solve this problem.

The avoid_builtin_multicast=false setting enables the use of multicast also during Endpoints Discovery Phase (EDP). It reduces the number of sent packages during EDP because each multicast data is sent at the same time to all participants, thereby reducing the traffic. You could also try re-enabling it by avoiding_builtin_multicast=true and setting the TTL parameter in UDPv4TransportDescriptor to 0. This way you will be sure that your traffic is local. In order to do that, you will also need to set use_builtin_transports=false and add a SharedMemTransportDescriptor and a UDPv4TransportDescriptor to user transport.

DomainParticipantQos participant_qos;
participant_qos.transport().use_builtin_transports = false;
auto shm_transport = std::make_shared<SharedMemTransportDescriptor>();
participant_qos.transport().user_transports.push_back(shm_transport);
auto udp_transport = std::make_shared<UDPv4TransportDescriptor>();
udp_transport->TTL = 0;
participant_qos.transport().user_transports.push_back(udp_transport);

However,I noticed that the pdp message interval is not 100ms when start, I set initial_announce_period": 100ms,This is why?

I would need more information about the screenshot. From the information I have, I can tell you that initial_announce_period set the specific period for each participant, maybe the timestamps that you are looking at are from different participants, so the difference is not 100ms.

Apr 09 '24 10:04 elianalf

Hi @elianalf [thanks for your reply. I set avoiding_builtin_multicast=true and set udp_transport->TTL = 0;,also enable udp and shm. As you provided the reference code, there are still lost packets.

I would need more information about the screenshot. From the information I have, I can tell you that initial_announce_period set the specific period for each participant, maybe the timestamps that you are looking at are from different participants, so the difference is not 100ms.

Here are more screenshots to illustrate the pdp message sent by the same particpant.

ecEq0yZeRx HQ0OXUQIuZ IQzGWHmEmR

Apr 10 '24 06:04 TechVortexZ

Hi,

I set avoiding_builtin_multicast=true and set udp_transport->TTL = 0;,also enable udp and shm. As you provided the reference code, there are still lost packets.

If your application requires to work only in local host and you obtain better performance setting avoid_builtin_multicast=false, then that is a possible solution. That variable is set to true by default because disabling multicast during EDP on big network can be more secure.

Here are more screenshots to illustrate the pdp message sent by the same particpant.

All these packages are not only initialAnnouncements packages. Each participant sends an initialAnnouncements package every initial_announce_period, but every time it discovers a participant it begins sending Data(p) packages to each multicast locator and to all known participants unicast locators. So between two initialAnnouncements packages, there might be many other Data(p). That is why the frequency of the packages you highlight is higher.

Apr 11 '24 10:04 elianalf

but every time it discovers a participant it begins sending Data(p) packages to each multicast locator and to all known participants unicast locators. So between two initialAnnouncements packages, there might be many other Data(p). That is why the frequency of the packages you highlight is higher.

Hi @elianalf thanks for your reply. Your answer above is right.

I want to ask the last question. I found an article on the fastdds website: https://www.eprosima.com/index.php/resources-all/scalability/fast-rtps-discovery-mechanisms-analysis. One of the conclusions in this article is that SDP causes network congestion:

Because of all the previous, it is concluded that the SDP produces network congestion in those cases where a high number of participants are involved in the communication. This leads to a higher packet loss and therefore to a reduction of the overall performance. The protocol implementation is open to optimizations, such as eliminating the duplicate announcements when new participants are discovered (which could lead to a PDP traffic reduction of around 28%), or limiting the announcement reply to a discovered participant to just that new participant (which could cut another 25% of the traffic in the testing scenarios).

It says that fastdds will provide optimization measures to reduce duplicate announcements, What are these optimization measures?

Apr 15 '24 08:04 TechVortexZ

Hi, The article refers to Discover Server Mechanism. For any other information, I would recommend you to refer to the Documentation and not to the website because it is more detailed and constantly updated.

Apr 15 '24 14:04 elianalf

Fast-DDS Fast-DDS copied to clipboard

When these processes start at the same time, many dropped parckets were generated by the 127.0.0.1 network

Is there an already existing issue for this?

Expected behavior

Current behavior

Steps to reproduce

Fast DDS version/commit

Platform/Architecture

Transport layer

Additional context

XML configuration file

Relevant log output

Network traffic capture

Fast-DDS
Fast-DDS copied to clipboard