rmw icon indicating copy to clipboard operation
rmw copied to clipboard

rmw_publish can block

Open cleitner opened this issue 5 years ago • 0 comments

Bug report

Required Info:

  • Operating System:
    • Debian and Ubuntu
  • Installation type:
    • From source
  • Version or commit hash:
    • Eloquent and Foxy HEAD
  • DDS implementation:
    • Fast-RTPS and CycloneDDS
  • Client library (if applicable):
    • Tested with rclpy, problem is based in RMW

Steps to reproduce issue

The fast (1kHz) publisher and slow (10Hz) subscriber https://gist.github.com/cleitner/93decfa79a99a8a3b59a795df02b99e7 https://gist.github.com/cleitner/e5eb35f6bcf6639425c06c03dd13fbff expose problems with buffer bloat and, more relevant here, issues with the rmw_publish interface.

As described in #176, the RMW implementations can cause rmw_publish to block when history is set to KEEP_ALL

Calling the two scripts with

$ ./pub_back_pressure.py
$ ./sub_back_pressure.py

will sometimes exhibit blocking times > 20ms in rmw_publish and more often RMW_RET_ERROR when the internal RESOURCE_LIMITS setting is hit.

...
Failed to publish at 10032 with Failed to publish: cannot publish data, at .../src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:53, at .../src/ros2/rcl/rcl/src/rcl/publisher.c:290
Failed to publish at 10032 with Failed to publish: cannot publish data, at .../src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:53, at .../src/ros2/rcl/rcl/src/rcl/publisher.c:290
Slowed down at 10032 for 41.456 ms
Slowed down at 10035 for 101.639 ms
Failed to publish at 10039 with Failed to publish: cannot publish data, at .../src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:53, at .../src/ros2/rcl/rcl/src/rcl/publisher.c:290
Failed to publish at 10032 with Failed to publish: cannot publish data, at .../src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_publish.cpp:53, at .../src/ros2/rcl/rcl/src/rcl/publisher.c:290
...

With

$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./pub_back_pressure.py
$ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./sub_back_pressure.py

the publisher blocks with (seemingly) random timeouts, but no errors:

...
Slowed down at 3205 for 21.131 ms
Slowed down at 3207 for 21.206 ms
Slowed down at 3209 for 209.976 ms
Slowed down at 3210 for 275.831 ms
Slowed down at 3211 for 173.350 ms
Slowed down at 3212 for 97.646 ms
Slowed down at 3213 for 43.757 ms
...

The subscriber crashes with OoM because CycloneDDS doesn't seem to limit the incoming buffer.

Expected behavior

At least consistent behavior regarding returning errors or blocking.

The IMHO correct result would be an indication of a need for blocking, akin to the taken parameter to rmw_take to allow pure polling operation of rmw_publish, i.e. in cyclic realtime context.

No crashing in the subscriber with default settings.

Actual behavior

rmw_publish blocks when some unknown internal (and external) limit is hit. Fast-RTPS also returns RMW_RET_ERROR, without having experiences a real error.

CycloneDDS subscriber can be crashed with OoM if the publisher exceeds the subscribers processing bandwidth.

Additional information

The write function of the DDS DataWriter interface has a OUT_OF_RESOURCES return code and specifies a max_block_time parameter for the RELIABILITY QoS, which would preferrably be set to 0.

The interface of rmw_publish could be made symmetric to rmw_take:

rmw_ret_t
rmw_publish(
  const rmw_publisher_t * publisher,
  const void * ros_message,
  bool * published,
  rmw_publisher_allocation_t * allocation);

cleitner avatar Mar 30 '20 15:03 cleitner