CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery
Is there an already existing issue for this?
- [X] I have searched the existing issues
Expected behavior
The CPU consumption is not affected so much by the number of the spawned domain participant.
Current behavior
A CPU overload happens when spawning several domain participants.
Steps to reproduce
I'm spawning several domain participants in different threads using Simple Discovery as discovery mechanism. I'm using the following code (I'm spawning 170 domain participant in this case)
#include <fastdds/dds/domain/DomainParticipant.hpp>
#include <fastdds/dds/domain/DomainParticipantFactory.hpp>
#include <fastdds/dds/domain/DomainParticipantListener.hpp>
#include <fastdds/rtps/transport/UDPv4TransportDescriptor.hpp>
#include <chrono>
#include <thread>
eprosima::fastdds::dds::DomainParticipant* create_participant(const std::string& name){
// Configure participant QoS
eprosima::fastdds::dds::DomainParticipantQos participant_qos;
// Use simple discovery
participant_qos.wire_protocol().builtin.discovery_config.discoveryProtocol = eprosima::fastdds::rtps::DiscoveryProtocol::SIMPLE;
// Configure discovery settings
participant_qos.wire_protocol().builtin.discovery_config.leaseDuration = eprosima::fastdds::dds::Duration_t(3, 1);
participant_qos.wire_protocol().builtin.discovery_config.leaseDuration_announcementperiod = eprosima::fastdds::dds::Duration_t(1, 2);
// Increase limit of discoverable data readers/writers (default is 100u)
participant_qos.wire_protocol().builtin.mutation_tries = 250u;
// Set participant name
participant_qos.name(name);
// Use only UDPv4 transport
auto udp_transport = std::make_shared<eprosima::fastdds::rtps::UDPv4TransportDescriptor>();
participant_qos.transport().user_transports.push_back(udp_transport);
participant_qos.transport().use_builtin_transports = false;
// Create the participant
eprosima::fastdds::dds::DomainParticipant *participant = eprosima::fastdds::dds::DomainParticipantFactory::get_instance()->create_participant(
0,
participant_qos,
nullptr,
eprosima::fastdds::dds::StatusMask::none()
);
if (!participant)
throw std::runtime_error("Error: could not create participant");
return participant;
}
void ddsparticipant_thread(std::stop_token st, const std::string name)
{
// Create domain participant
eprosima::fastdds::dds::DomainParticipant* participant = create_participant(name);
while(!st.stop_requested())
{
std::this_thread::sleep_for(std::chrono::milliseconds(2000));
}
}
int main()
{
// Number of participants to spawn
const int num_participants = 170;
// Spawn participants
std::vector<std::jthread> threads;
for (int i = 0; i < num_participants; ++i)
{
threads.push_back(std::jthread(ddsparticipant_thread, "participant_" + std::to_string(i)));
}
while (true)
{
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
return 0;
}
Here instead there is a screenshot of the CPU consumption when spawning 70 and 170 domain participants
As a workaround, I'm already using the Discovery Server mechanism. However, some of the available tools for fastdds like DDS-Record-Replay or Fast-DDS-spy does not support Discovery Server. More in general, I was surprised to see this CPU overload, so I would like to understand better why it is happening.
Fast DDS version/commit
v3.1.0
Platform/Architecture
Other. Please specify in Additional context section.
Transport layer
UDPv4
Additional context
The test is executed inside a docker image with Ubuntu Jammy Jellyfish 22.04 amd64.
The CPU is an Intel 13th Gen i7-13700H, in the following there more details (from lscpu command)
XML configuration file
No response
Relevant log output
No response
Network traffic capture
No response
Hi @MMarcus95 ,
Thank you for reporting the issue. The behavior you are describing is already known, and we are actively working on a solution. The problem is due to an excess of discovery messages sent across the entire network (including to participants that are already matched) whenever a new participant spawns. This leads to an exponential increase in CPU usage with each additional participant.
We are currently testing the fix to address this inefficiency, and once it has been successfully validated it will be included in the next release.
In the meantime, using a tool such as the Discovery Server is an effective workaround, as it significantly reduces the amount of discovery traffic. Additionally, as a general recommendation, try to minimize the number of participants, as each participant inherently consumes resources due to its associated threads (check out the following table). Is there any particular reason for having this many participants in your setup?
Thank you for your patience and stay tuned for the upcoming release.
Hi @EugenioCollado,
thank you for the detailed explanation. Here at the Dynamic Legged Systems (DLS) lab we are developing a distributed, modular software framework for controlling robots (see A Practical Real-Time Distributed Software Framework for Mobile Robots).
In the current implementation we need some domain participants addressing several aspects of the framework behavior. Despite the domain separation, the CPU overload issue arised and we switched to Discovery Server. However, we would like to have a more distributed approach, using Simple Discovery. We could in the future try to reduce the number of domain participant. However, the CPU overload could anyway arise when using multiple robots.
That's said, I'm really eager to see this fixed in the next releases!
Thanks for all your effort.
Thank you for sharing details about your project at the DLS lab; it sounds fascinating! If you have any further needs, questions, or would like to be informed when the fix is ready, please don’t hesitate to reach out via email to [email protected].
Thank you for your valuable input, and we wish you continued success with your framework development!
@MMarcus95 This might have been fixed by #5604, which has already been backported to 3.1.x
@MiguelCompany Hi, currently I am encountering the same problem on the 2.6.10 branch. Do you have any plans to fix this problem on the 2.6.x branch? Looking forward to your reply, thank you!
Hi @slp12138, 2.6 is a maintenance branch, meaning that only receives patches for critical issues and security fixes. So this will not be backported. I recommend you to use the 2.14 branch or contact commercial support for tailored support.
Closing this issue as the fix has already been addressed in https://github.com/eProsima/Fast-DDS/pull/5604.