rclpy icon indicating copy to clipboard operation
rclpy copied to clipboard

memory leak

Open obi-t4 opened this issue 2 years ago • 6 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu20.04
  • Installation type:
    • binary and source, either is OK
  • Version or commit hash:
    • HEAD of galactic branch
  • DDS implementation:
    • rmw_cyclonedds_cpp (ros2 doctor --report says)
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

The following code is almost same as examples/rclpy/services/minimal_client/examples_rclpy_minimal_client/client.py

In this modification, main is called in the while loop and memory usage is observed. Obviously everything is terminated cleanly, but the rss increases monotonically.

rosbridge is facing this memory leak issue. https://github.com/RobotWebTools/rosbridge_suite/pull/774 https://github.com/RobotWebTools/rosbridge_suite/issues/772

from example_interfaces.srv import AddTwoInts

import rclpy


def main(args=None):
    rclpy.init(args=args)
    node = rclpy.create_node('minimal_client')
    cli = node.create_client(AddTwoInts, 'add_two_ints')

    req = AddTwoInts.Request()
    req.a = 41
    req.b = 1
    while not cli.wait_for_service(timeout_sec=1.0):
        node.get_logger().info('service not available, waiting again...')

    future = cli.call_async(req)
    rclpy.spin_until_future_complete(node, future)

    result = future.result()
    node.get_logger().info(
        'Result of add_two_ints: for %d + %d = %d' %
        (req.a, req.b, result.sum))

    node.destroy_node()
    rclpy.shutdown()


if __name__ == '__main__':
    import time
    import psutil

    while True:
        main()
        rss = psutil.Process().memory_info().rss
        print(rss)
        time.sleep(0.5)

Expected behavior

rss (=usage of memory) should not be increased.

Actual behavior

rss (=usage of memory) is increased.

Additional information

The following code also reproduces this issue. So this is not only for service call problem.

def main(args=None):
    rclpy.init(args=args)
    node = rclpy.create_node('minimal_client')
    node.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    import time
    import psutil

    while True:
        main()
        rss = psutil.Process().memory_info().rss
        print(rss)
        time.sleep(0.5)

obi-t4 avatar Nov 01 '22 13:11 obi-t4

A couple of observations about this example:

  1. rclpy.init and rclpy.shutdown aren't really meant to be run in a loop like that. I mean, you can do it, but it isn't totally expected. So I would pull those out of the loop and put them in the top-level.
  2. Because this is in a garbage collected language, you can't completely take the RSS as-is. At the very least, you need to call gc.collect during the loop there to attempt to make Python collect as much as it can.

All of that said, I tried this out on the latest Rolling with those changes, and it still looks like there is some leaking going on during the create_node/destroy_node pair. In a brief look with valgrind nothing obvious popped up, but there may be something deeper going on.

clalancette avatar Nov 01 '22 13:11 clalancette

Okay, this actually happens on current rolling. as far as i checked, this is RMW agnostic problem since we can make it happen with Fast-DDS and Cyclonedds.

Just FYI, similar problem for rclcpp.

  • https://github.com/ros2/rclcpp/issues/1925
  • https://github.com/ros2/rclcpp/issues/1254

fujitatomoya avatar Nov 01 '22 18:11 fujitatomoya

CC: @iuhilnehc-ynos @Barry-Xu-2018

fujitatomoya avatar Nov 01 '22 18:11 fujitatomoya

I don't think each call with "create_node" and "destroy_node" in the loop should get the same RSS(RES) value at the beginning. What I expect is it can get a steady value for RSS after a while.

I tested with https://github.com/fujitatomoya/ros2_test_prover/commit/cdd3397fe516571b9fe53894a63b3df588e2ac52, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes (Unfortunately, there is a tiny limitation about re-creating a participant, which is another topic.).

No matter how long I run the test with rmw_cyclonedds_cpp, the RSS value is always increased. After applying the following two patches for rmw_cyclonedds_cpp and cyclonedds, the RSS used in the test can keep a steady value after a few minutes too.

I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below

  • rmw_cyclonedds_cpp
diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index 9b712ac..781e22a 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -1322,6 +1322,17 @@ void
 rmw_context_impl_t::clean_up()
 {
   discovery_thread_stop(common);
+  dds_guid_t guid;
+  rmw_gid_t gid;
+  auto sub = static_cast<CddsSubscription *>(common.sub->data);
+  dds_get_guid (sub->enth, &guid);
+  convert_guid_to_gid(guid, gid);
+  common.graph_cache.remove_entity(gid, true);
+  auto pub = static_cast<CddsPublisher *>(common.pub->data);
+  dds_get_guid (pub->enth, &guid);
+  convert_guid_to_gid(guid, gid);
+  common.graph_cache.remove_entity(gid, false);
+  common.graph_cache.remove_participant(common.gid);
   common.graph_cache.clear_on_change_callback();
   if (common.graph_guard_condition) {
     destroy_guard_condition(common.graph_guard_condition);
  • cyclonedds
diff --git a/src/core/ddsi/src/q_thread.c b/src/core/ddsi/src/q_thread.c
index 6de01fa1..f49b1885 100644
--- a/src/core/ddsi/src/q_thread.c
+++ b/src/core/ddsi/src/q_thread.c
@@ -329,12 +329,14 @@ dds_return_t create_thread (struct thread_state1 **ts1, const struct ddsi_domain
 
 static void reap_thread_state (struct thread_state1 *ts1, bool in_thread_states_fini)
 {
+  bool lazily_created_thread = false;
   ddsrt_mutex_lock (&thread_states.lock);
   switch (ts1->state)
   {
+    case THREAD_STATE_LAZILY_CREATED:
+      lazily_created_thread = true;
     case THREAD_STATE_INIT:
     case THREAD_STATE_STOPPED:
-    case THREAD_STATE_LAZILY_CREATED:
       ts1->state = THREAD_STATE_ZERO;
       break;
     case THREAD_STATE_ZERO:
@@ -360,6 +362,9 @@ static void reap_thread_state (struct thread_state1 *ts1, bool in_thread_states_
       assert (0);
   }
   ddsrt_mutex_unlock (&thread_states.lock);
+  if (in_thread_states_fini && lazily_created_thread) {
+    ddsrt_thread_cleanup_pop(1);
+  }
 }
 
 dds_return_t join_thread (struct thread_state1 *ts1)

NOTE: we can decrease the waiting time to time.sleep(0.1) in the loop to get a steady RSS value quickly.

iuhilnehc-ynos avatar Nov 28 '22 03:11 iuhilnehc-ynos

I tested with https://github.com/fujitatomoya/ros2_test_prover/commit/cdd3397fe516571b9fe53894a63b3df588e2ac52, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes

confirmed, i guess i was impatient to wait ... sorry.

I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below

removing itself from graph cache entity makes sense. i think this is just a bug. but i am not sure about cyclonedds fix.

@eboasson would you mind taking a look at above patch?

@iuhilnehc-ynos can you create PR after the confirmation?

fujitatomoya avatar Dec 01 '22 21:12 fujitatomoya

I tested with fujitatomoya/ros2_test_prover@cdd3397, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes

confirmed, i guess i was impatient to wait ... sorry.

I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below

removing itself from graph cache entity makes sense. i think this is just a bug. but i am not sure about cyclonedds fix.

That was indeed simply a missing "remove", all the others are balanced (from a quick check of the code), so the RMW fix is I believe entirely correct.

@eboasson would you mind taking a look at above patch?

I do get it and I see why this works, but I need to spend some more mental effort on it before declaring it correct 🙂

eboasson avatar Dec 02 '22 08:12 eboasson