Fast-DDS icon indicating copy to clipboard operation
Fast-DDS copied to clipboard

CPU overloads during the PDP phase with multiple Domain Participants using Simple Discovery

Open MMarcus95 opened this issue 1 year ago • 4 comments

Is there an already existing issue for this?

  • [X] I have searched the existing issues

Expected behavior

The CPU consumption is not affected so much by the number of the spawned domain participant.

Current behavior

A CPU overload happens when spawning several domain participants.

Steps to reproduce

I'm spawning several domain participants in different threads using Simple Discovery as discovery mechanism. I'm using the following code (I'm spawning 170 domain participant in this case)

#include <fastdds/dds/domain/DomainParticipant.hpp>
#include <fastdds/dds/domain/DomainParticipantFactory.hpp>
#include <fastdds/dds/domain/DomainParticipantListener.hpp>
#include <fastdds/rtps/transport/UDPv4TransportDescriptor.hpp>

#include <chrono>
#include <thread>


eprosima::fastdds::dds::DomainParticipant* create_participant(const std::string& name){
    // Configure participant QoS
    eprosima::fastdds::dds::DomainParticipantQos participant_qos;
    // Use simple discovery
    participant_qos.wire_protocol().builtin.discovery_config.discoveryProtocol = eprosima::fastdds::rtps::DiscoveryProtocol::SIMPLE;
    // Configure discovery settings
    participant_qos.wire_protocol().builtin.discovery_config.leaseDuration = eprosima::fastdds::dds::Duration_t(3, 1);
    participant_qos.wire_protocol().builtin.discovery_config.leaseDuration_announcementperiod = eprosima::fastdds::dds::Duration_t(1, 2);
    // Increase limit of discoverable data readers/writers (default is 100u)
    participant_qos.wire_protocol().builtin.mutation_tries = 250u;
    // Set participant name
    participant_qos.name(name);
    // Use only UDPv4 transport
    auto udp_transport = std::make_shared<eprosima::fastdds::rtps::UDPv4TransportDescriptor>();
    participant_qos.transport().user_transports.push_back(udp_transport);
    participant_qos.transport().use_builtin_transports = false;
    // Create the participant
    eprosima::fastdds::dds::DomainParticipant *participant = eprosima::fastdds::dds::DomainParticipantFactory::get_instance()->create_participant(
        0,
        participant_qos,
        nullptr,
        eprosima::fastdds::dds::StatusMask::none()
    );
    if (!participant)
        throw std::runtime_error("Error: could not create participant");

    return participant;
}

void ddsparticipant_thread(std::stop_token st, const std::string name)
{
    // Create domain participant
    eprosima::fastdds::dds::DomainParticipant* participant = create_participant(name);
    
    while(!st.stop_requested())
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(2000));
    }
}

int main()
{
    // Number of participants to spawn
    const int num_participants = 170;

    // Spawn participants
    std::vector<std::jthread> threads;
    for (int i = 0; i < num_participants; ++i)
    {
        threads.push_back(std::jthread(ddsparticipant_thread, "participant_" + std::to_string(i)));
    }

    while (true)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    }

    return 0;
}

Here instead there is a screenshot of the CPU consumption when spawning 70 and 170 domain participants cpu_70_170dp

As a workaround, I'm already using the Discovery Server mechanism. However, some of the available tools for fastdds like DDS-Record-Replay or Fast-DDS-spy does not support Discovery Server. More in general, I was surprised to see this CPU overload, so I would like to understand better why it is happening.

Fast DDS version/commit

v3.1.0

Platform/Architecture

Other. Please specify in Additional context section.

Transport layer

UDPv4

Additional context

The test is executed inside a docker image with Ubuntu Jammy Jellyfish 22.04 amd64.

The CPU is an Intel 13th Gen i7-13700H, in the following there more details (from lscpu command) Screenshot from 2024-12-19 14-26-07

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

MMarcus95 avatar Dec 19 '24 18:12 MMarcus95

Hi @MMarcus95 ,

Thank you for reporting the issue. The behavior you are describing is already known, and we are actively working on a solution. The problem is due to an excess of discovery messages sent across the entire network (including to participants that are already matched) whenever a new participant spawns. This leads to an exponential increase in CPU usage with each additional participant.

We are currently testing the fix to address this inefficiency, and once it has been successfully validated it will be included in the next release.

In the meantime, using a tool such as the Discovery Server is an effective workaround, as it significantly reduces the amount of discovery traffic. Additionally, as a general recommendation, try to minimize the number of participants, as each participant inherently consumes resources due to its associated threads (check out the following table). Is there any particular reason for having this many participants in your setup?

Thank you for your patience and stay tuned for the upcoming release.

EugenioCollado avatar Dec 23 '24 13:12 EugenioCollado

Hi @EugenioCollado,

thank you for the detailed explanation. Here at the Dynamic Legged Systems (DLS) lab we are developing a distributed, modular software framework for controlling robots (see A Practical Real-Time Distributed Software Framework for Mobile Robots).

In the current implementation we need some domain participants addressing several aspects of the framework behavior. Despite the domain separation, the CPU overload issue arised and we switched to Discovery Server. However, we would like to have a more distributed approach, using Simple Discovery. We could in the future try to reduce the number of domain participant. However, the CPU overload could anyway arise when using multiple robots.

That's said, I'm really eager to see this fixed in the next releases!

Thanks for all your effort.

MMarcus95 avatar Jan 10 '25 15:01 MMarcus95

Thank you for sharing details about your project at the DLS lab; it sounds fascinating! If you have any further needs, questions, or would like to be informed when the fix is ready, please don’t hesitate to reach out via email to [email protected].

Thank you for your valuable input, and we wish you continued success with your framework development!

EugenioCollado avatar Jan 13 '25 07:01 EugenioCollado

@MMarcus95 This might have been fixed by #5604, which has already been backported to 3.1.x

MiguelCompany avatar Mar 07 '25 07:03 MiguelCompany

@MiguelCompany Hi, currently I am encountering the same problem on the 2.6.10 branch. Do you have any plans to fix this problem on the 2.6.x branch? Looking forward to your reply, thank you!

slp12138 avatar Aug 08 '25 10:08 slp12138

Hi @slp12138, 2.6 is a maintenance branch, meaning that only receives patches for critical issues and security fixes. So this will not be backported. I recommend you to use the 2.14 branch or contact commercial support for tailored support.

cferreiragonz avatar Aug 18 '25 05:08 cferreiragonz

Closing this issue as the fix has already been addressed in https://github.com/eProsima/Fast-DDS/pull/5604.

rsanchez15 avatar Sep 22 '25 05:09 rsanchez15