rmw_fastrtps icon indicating copy to clipboard operation
rmw_fastrtps copied to clipboard

ROS2 CLI tests failing when run against rmw_fastrtps_cpp

Open hidmic opened this issue 4 years ago • 4 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu Focal 20.04 amd64
  • Installation type:
    • From source
  • Version or commit hash:
    • Current master
  • DDS implementation:
    • Fast-DDS
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

I have not been able to reproduce this issue locally, but it has been happening since late last week (https://ci.ros2.org/view/nightly/job/nightly_linux_repeated/2279).

Expected behavior

CLI tests pass.

Actual behavior

CLI tests fail with something like:

[RTPS_TRANSPORT_SHM Error] Failed to create segment 56b71f66bffcdbb2: boost::interprocess_exception::library_error -< Function compute_per_allocation_extra_size
[RTPS_MSG_OUT Error] boost::interprocess_exception::library_error -< Function init

Additional information

Logs are coming from within Fast-DDS.

hidmic avatar May 10 '21 17:05 hidmic

@EduPonz @MiguelCompany I'd really value your input here. The problem isn't obvious to me.

hidmic avatar May 10 '21 17:05 hidmic

@hidmic I've been analyzing this.

What is happening? As the CI is killing some participants in an uncontrolled way, the shared memory space is most probably being exhausted. Fast DDS shows the error but keeps working, though communication will only be through UDP in that case.

The errors are logged in case the user has configured the participant to only use the SHM transport. If that is the case and there is no SHM space available, the participant creation will fail.

What can be done? We suggest running fastdds shm clean before each test execution, which will remove any unused shared memory files. I don't know if that is possible for every test but for the CLI ones it could be done from test_cli.py.

I haven't studied the test thoroughly, but if it only checks the contents of stdout, another option would be to make Fast DDS output the logs to stderr. That can be achieved with the following XML configuration:

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
    <log>
        <!--
        Clear consumers
        -->
        <use_default>FALSE</use_default>

        <!--
        StdoutErrConsumer will output Warning and Error messages to stderr by default
        -->
        <consumer>
            <class>StdoutErrConsumer</class>
        </consumer>
    </log>
</dds>

As a last resource, we could change the code on Fast DDS to move the level of those messages from error to warning (as the latter is not reported by default), but we'd prefer to leave them as they are.

MiguelCompany avatar May 11 '21 06:05 MiguelCompany

@MiguelCompany thank you for the thorough analysis and investigation, much appreciated. I agree it doesn't make sense to change Fast-DDS, but before changing tests I'd like to understand why:

CI is killing some participants in an uncontrolled way

Most test processes exit normally and these test failures only show in repeated CI jobs, which makes me wonder if this isn't bad resource management somewhere in the stack. Can you confirm that destroy_participant() does in fact take out the participant? If that's a given, then we might be facing an issue up the stack. In that case, is there any tool or procedure you'd recommend to track down undead participants?

hidmic avatar May 11 '21 17:05 hidmic

Can you confirm that destroy_participant() does in fact take out the participant?

Yes, it this lines don't show an error, the participant resources should have been correctly cleaned up.

To check for an undead participant, you could ls /dev/shm/fastrtps*. I have seen that sometimes, the ros 2 daemon stays alive for a while and is suddenly killed (with a SIGKILL), resulting in destroy_participant not being called at all.

A similar problem arises with test timeouts, where ctest may also SIGKILL the process.

MiguelCompany avatar May 19 '21 14:05 MiguelCompany

We haven't seen this in quite a while, so I'm going to close this out. We'll open a new issue if we come across it again.

clalancette avatar Jun 13 '23 19:06 clalancette