rclpy
rclpy copied to clipboard
memory leak
Bug report
Required Info:
- Operating System:
- Ubuntu20.04
- Installation type:
- binary and source, either is OK
- Version or commit hash:
- HEAD of galactic branch
- DDS implementation:
- rmw_cyclonedds_cpp (
ros2 doctor --report
says)
- rmw_cyclonedds_cpp (
- Client library (if applicable):
- rclpy
Steps to reproduce issue
The following code is almost same as examples/rclpy/services/minimal_client/examples_rclpy_minimal_client/client.py
In this modification, main is called in the while loop and memory usage is observed. Obviously everything is terminated cleanly, but the rss increases monotonically.
rosbridge is facing this memory leak issue. https://github.com/RobotWebTools/rosbridge_suite/pull/774 https://github.com/RobotWebTools/rosbridge_suite/issues/772
from example_interfaces.srv import AddTwoInts
import rclpy
def main(args=None):
rclpy.init(args=args)
node = rclpy.create_node('minimal_client')
cli = node.create_client(AddTwoInts, 'add_two_ints')
req = AddTwoInts.Request()
req.a = 41
req.b = 1
while not cli.wait_for_service(timeout_sec=1.0):
node.get_logger().info('service not available, waiting again...')
future = cli.call_async(req)
rclpy.spin_until_future_complete(node, future)
result = future.result()
node.get_logger().info(
'Result of add_two_ints: for %d + %d = %d' %
(req.a, req.b, result.sum))
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
import time
import psutil
while True:
main()
rss = psutil.Process().memory_info().rss
print(rss)
time.sleep(0.5)
Expected behavior
rss (=usage of memory) should not be increased.
Actual behavior
rss (=usage of memory) is increased.
Additional information
The following code also reproduces this issue. So this is not only for service call problem.
def main(args=None):
rclpy.init(args=args)
node = rclpy.create_node('minimal_client')
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
import time
import psutil
while True:
main()
rss = psutil.Process().memory_info().rss
print(rss)
time.sleep(0.5)
A couple of observations about this example:
-
rclpy.init
andrclpy.shutdown
aren't really meant to be run in a loop like that. I mean, you can do it, but it isn't totally expected. So I would pull those out of the loop and put them in the top-level. - Because this is in a garbage collected language, you can't completely take the RSS as-is. At the very least, you need to call
gc.collect
during the loop there to attempt to make Python collect as much as it can.
All of that said, I tried this out on the latest Rolling with those changes, and it still looks like there is some leaking going on during the create_node
/destroy_node
pair. In a brief look with valgrind nothing obvious popped up, but there may be something deeper going on.
Okay, this actually happens on current rolling. as far as i checked, this is RMW agnostic problem since we can make it happen with Fast-DDS and Cyclonedds.
Just FYI, similar problem for rclcpp
.
- https://github.com/ros2/rclcpp/issues/1925
- https://github.com/ros2/rclcpp/issues/1254
CC: @iuhilnehc-ynos @Barry-Xu-2018
I don't think each call with "create_node" and "destroy_node" in the loop should get the same RSS(RES) value at the beginning. What I expect is it can get a steady value for RSS after a while.
I tested with https://github.com/fujitatomoya/ros2_test_prover/commit/cdd3397fe516571b9fe53894a63b3df588e2ac52, and found that using rmw_fastrtps_cpp
can keep a steady value after a few minutes (Unfortunately, there is a tiny limitation about re-creating a participant, which is another topic.).
No matter how long I run the test with rmw_cyclonedds_cpp
, the RSS value is always increased.
After applying the following two patches for rmw_cyclonedds_cpp
and cyclonedds
, the RSS used in the test can keep a steady value after a few minutes too.
I think this issue should be separately reported in the rmw_cyclonedds_cpp
and cyclonedds
, just a workaround patch below
- rmw_cyclonedds_cpp
diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index 9b712ac..781e22a 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -1322,6 +1322,17 @@ void
rmw_context_impl_t::clean_up()
{
discovery_thread_stop(common);
+ dds_guid_t guid;
+ rmw_gid_t gid;
+ auto sub = static_cast<CddsSubscription *>(common.sub->data);
+ dds_get_guid (sub->enth, &guid);
+ convert_guid_to_gid(guid, gid);
+ common.graph_cache.remove_entity(gid, true);
+ auto pub = static_cast<CddsPublisher *>(common.pub->data);
+ dds_get_guid (pub->enth, &guid);
+ convert_guid_to_gid(guid, gid);
+ common.graph_cache.remove_entity(gid, false);
+ common.graph_cache.remove_participant(common.gid);
common.graph_cache.clear_on_change_callback();
if (common.graph_guard_condition) {
destroy_guard_condition(common.graph_guard_condition);
- cyclonedds
diff --git a/src/core/ddsi/src/q_thread.c b/src/core/ddsi/src/q_thread.c
index 6de01fa1..f49b1885 100644
--- a/src/core/ddsi/src/q_thread.c
+++ b/src/core/ddsi/src/q_thread.c
@@ -329,12 +329,14 @@ dds_return_t create_thread (struct thread_state1 **ts1, const struct ddsi_domain
static void reap_thread_state (struct thread_state1 *ts1, bool in_thread_states_fini)
{
+ bool lazily_created_thread = false;
ddsrt_mutex_lock (&thread_states.lock);
switch (ts1->state)
{
+ case THREAD_STATE_LAZILY_CREATED:
+ lazily_created_thread = true;
case THREAD_STATE_INIT:
case THREAD_STATE_STOPPED:
- case THREAD_STATE_LAZILY_CREATED:
ts1->state = THREAD_STATE_ZERO;
break;
case THREAD_STATE_ZERO:
@@ -360,6 +362,9 @@ static void reap_thread_state (struct thread_state1 *ts1, bool in_thread_states_
assert (0);
}
ddsrt_mutex_unlock (&thread_states.lock);
+ if (in_thread_states_fini && lazily_created_thread) {
+ ddsrt_thread_cleanup_pop(1);
+ }
}
dds_return_t join_thread (struct thread_state1 *ts1)
NOTE: we can decrease the waiting time to time.sleep(0.1)
in the loop to get a steady RSS value quickly.
I tested with https://github.com/fujitatomoya/ros2_test_prover/commit/cdd3397fe516571b9fe53894a63b3df588e2ac52, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes
confirmed, i guess i was impatient to wait ... sorry.
I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below
removing itself from graph cache entity makes sense. i think this is just a bug. but i am not sure about cyclonedds fix.
@eboasson would you mind taking a look at above patch?
@iuhilnehc-ynos can you create PR after the confirmation?
I tested with fujitatomoya/ros2_test_prover@cdd3397, and found that using rmw_fastrtps_cpp can keep a steady value after a few minutes
confirmed, i guess i was impatient to wait ... sorry.
I think this issue should be separately reported in the rmw_cyclonedds_cpp and cyclonedds, just a workaround patch below
removing itself from graph cache entity makes sense. i think this is just a bug. but i am not sure about cyclonedds fix.
That was indeed simply a missing "remove", all the others are balanced (from a quick check of the code), so the RMW fix is I believe entirely correct.
@eboasson would you mind taking a look at above patch?
I do get it and I see why this works, but I need to spend some more mental effort on it before declaring it correct 🙂