Windows 10 shared memory related crash/hang
Bug report
Required Info:
-
Operating System: Windows 10
-
Installation type: Binaries
-
Version or commit hash: Humble (Patch Release 4, 43083e5)
-
DDS implementation: Fast-RTPS
-
Client library (if applicable): rclcpp
Steps to reproduce issue
Sometimes when we run our launchfile one or more of our nodes will immediately die or seemingly hang. It seems to be caused by using Fast-RTPS shared memory on Windows...we are able to get around some of these issues by doing fastdds.bat shm clean but today we got the error pasted below and aren't sure how to proceed.
If anyone has any advice to make this reproducible and/or knows how to fix it that would be great. We're hoping that a new Windows Humble patch release with Fast-RTPS/Fast-DDS version bumps (which should solve this) could also help us out, but any advice in the meantime is appreciated!
Additional information
Error message:
[robot_controller.EXE-2] A subdirectory or file C:\ProgramData\eprosima\fastrtps_interprocess already exists.
[robot_controller.EXE-2]
[robot_controller.EXE-2] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[robot_controller.EXE-2] This error state is being overwritten:
[robot_controller.EXE-2]
[robot_controller.EXE-2] '__create_participant failed to create participant, at C:\ci\ws\src\ros2\rmw_fastrtps\rmw_fastrtps_shared_cpp\src\participant.cpp:105, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:264'
[robot_controller.EXE-2]
[robot_controller.EXE-2] with this new error message:
[robot_controller.EXE-2]
[robot_controller.EXE-2] 'rcl node's rmw handle is invalid, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:416'
[robot_controller.EXE-2]
[robot_controller.EXE-2] rcutils_reset_error() should be called after error handling to avoid this.
[robot_controller.EXE-2] <<<
[robot_controller.EXE-2] [ERROR] [1695302197.677919300] [rcl]: Failed to fini publisher for node: 1
[ERROR] [robot_controller.EXE-2]: process has died [pid 2704, exit code 3221226505, cmd 'C:\ws\resources\ros2_backend\ros_nodes\install\lib\robodk_yaskawa\robot_controller.EXE --ros-args -r __node:=robot_controller --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_a68uie0v --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_c4v5b50e --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_4v7q3bh_'].
Fast-RTPS profile:
<?xml version="1.0" encoding="UTF-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<transport_descriptors>
<!-- We still want the default UDP transport to be present -->
<transport_descriptor>
<transport_id>udp_transport</transport_id>
<type>UDPv4</type>
</transport_descriptor>
<!-- We want a customized version of the SHM transport -->
<transport_descriptor>
<transport_id>shm_transport</transport_id>
<type>SHM</type>
<!--
Configure the size in bytes of the shared memory segment.
-->
<segment_size>100000000</segment_size>
<!--
The following value should be lower than or equal to segment_size.
The default value is shown here.
-->
<maxMessageSize>100000000</maxMessageSize>
</transport_descriptor>
</transport_descriptors>
<!-- Default publisher profile -->
<publisher profile_name="default publisher profile" is_default_profile="true">
<qos>
<publishMode>
<kind>ASYNCHRONOUS</kind>
</publishMode>
</qos>
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<participant profile_name="default publisher profile" is_default_profile="true">
<rtps>
<!--
As customized versions of the default transports are used, the
built-in ones should be disabled
-->
<useBuiltinTransports>false</useBuiltinTransports>
<!-- Link the Transport Layer to the Participant -->
<userTransports>
<transport_id>shm_transport</transport_id>
</userTransports>
</rtps>
</participant>
<!-- Default subscriber profile -->
<subscriber profile_name="default subscriber profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
<participant profile_name="default subscriber profile" is_default_profile="true">
<rtps>
<!--
As customized versions of the default transports are used, the
built-in ones should be disabled
-->
<useBuiltinTransports>false</useBuiltinTransports>
<!-- Link the Transport Layer to the Participant -->
<userTransports>
<transport_id>shm_transport</transport_id>
</userTransports>
</rtps>
</participant>
<!-- Request subscriber profile for services -->
<subscriber profile_name="service">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
<participant profile_name="service">
<rtps>
<!--
As customized versions of the default transports are used, the
built-in ones should be disabled
-->
<useBuiltinTransports>false</useBuiltinTransports>
<!-- Link the Transport Layer to the Participant -->
<userTransports>
<transport_id>shm_transport</transport_id>
</userTransports>
</rtps>
</participant>
<!-- Request publisher profile for clients -->
<publisher profile_name="client">
<qos>
<publishMode>
<kind>ASYNCHRONOUS</kind>
</publishMode>
</qos>
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<participant profile_name="client">
<rtps>
<!--
As customized versions of the default transports are used, the
built-in ones should be disabled
-->
<useBuiltinTransports>false</useBuiltinTransports>
<!-- Link the Transport Layer to the Participant -->
<userTransports>
<transport_id>shm_transport</transport_id>
</userTransports>
</rtps>
</participant>
</profiles>
</dds>
There is a humble patch release pending to update Fast DDS to v2.6.6, which has several improvements related to SHM and resilience to crashing nodes.
@headlee We've now released Humble Patch release 5: https://github.com/ros2/ros2/releases/tag/release-humble-20230925 . Can you check to see if the problem you reported is fixed? Thanks.
The new patch release has definitely helped. We still see some intermittent issues but it's unclear at this point if they are due to fastrtps, ros daemon on Windows, or something else.
Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.
Is there any downside to us deleting the contents of the fastrtps_interprocess folder (
C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.
@headlee No downsides. You could still do that.
Is there any downside to us deleting the contents of the fastrtps_interprocess folder (
C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.@headlee No downsides. You could still do that.
Great, thanks!
One more sort of related question: do you know if it's possible to change the directory that eprosima uses to store the shm files on Windows? I found datasharingqospolicy in the docs but I don't know if that's for normal shm or just the DataReader/DataWriter delivery. Haven't had a chance to test it out yet.
do you know if it's possible to change the directory that eprosima uses to store the shm files on Windows?
@headlee I'm afraid it cannot be changed. It is fixed here