beluga icon indicating copy to clipboard operation
beluga copied to clipboard

When using large maps, the lifecycle node kills amcl_node because of missing heartbeat

Open glpuga opened this issue 1 year ago • 3 comments

Bug description

Related to #468 , which discusses the beluga_amcl node becoming unresponsive for long periods of time when receiving large maps.

This issue is about a complementary issue that seems to be triggered by that behavior, which is that because of the node becoming unresponsive, the Nav2 Lifecycle manager loses track of the connection to the node and kills the node after bond_timeout seconds (parameter to the lifecycle manager), and proceeds to restart the beluga amcl node.

Notice that this happens regardless of the value of bond_timeout. Setting this to a very large value (e.g. 100 seconds) and providing the node with a 600m x 200m @ 0.05p/m map (which takes 22 seconds to initialize), the following sequence will happen.

  • The lifecycle manager and the amcl node get initialized.
  • the lifecycle manager and the amcl node connect to eacht other with bond.
  • amcl receives the map, and becomes unresponsive for approx. 22 seconds.
  • amcl completes the initialization after processing the likelihood map and becomes fully operational.
  • after bond_timeout seconds after the creation of the bond, lifecycle manager complains about "[lifecycle_manager]: CRITICAL FAILURE: SERVER amcl IS DOWN after not receiving a heartbeat for NNNN ms"
  • The lifecycle node restarts the amcl node.

Image

~~The fact that the lifecycle manager does not detect the amcl node returning back to operational status may not be an issue within beluga_amcl, and might be in the lifecycle manager itself, but beluga_amcl certainly triggers this failure. The same failure mode does not happen if everything else is the same but nav2_amcl is used instead of beluga_amcl.~~

See below, the problem is in Beluga itself.

Platform (please complete the following information):

  • OS: Seen in ROS Humble.
  • Beluga version: 2.0.2

How to reproduce

Run any system using beluga and a large map. 200mx600m at 0.05 m/pixel resolution is large enough.

Expected behavior

The node should not go unresponsive for so long.

Actual behavior

Node freezes, lifecycle node complains and actually fails.

Additional context

glpuga avatar Feb 04 '25 19:02 glpuga

I'm still digging into this issue.

glpuga avatar Feb 04 '25 19:02 glpuga

The issue is in Beluga, not in the Lifecle manager nor the bond_core library.

The reason the lifecycle manager node kills beluga regardless of the heartbeat_timeout value that is configured is because, after freezing during likelihood map creation, beluga stops publishing messages through the /bond topic. This is true even after Beluga becomes responsive again and starts working normally.

That happens because the Beluga side of the bond connection hardcodes the heartbeat_timeout to just 4 seconds. While beluga is frozen it is obviously is not sending Status messages through /bond, but more importantly is also not processing the heartbeat messages received through the same topic from the Lifecycle manager node.

Because of that, the heartbeat timeout timer in Beluga becomes is triggered after 4 seconds, while Beluga is still working generating the likelihood map. When this activation gets processed right after the freeze ends, it kills the Beluga side of the bond connection.

Beluga then stops sending Status messages, despite being active, and the Lifecycle node eventually times out as well (depending on the value of the heartbeat_timeout parameter of the node) and kills the node.

Multiple variants of Beluga MCL variants are affected (e.g. NDT).

This will affect any system that takes longer than 4 seconds to generate the likelihood map, which by my experience is not a really large map, probably about 150m x 150m on my I9, and probably a lot less on a lesser machine.

Solutions:

  • Short term, raising the default timeout in Beluga to 30 seconds. This won't change the fact that the timeout will have to be raised in the lifecycle manager configuration for things to work, but will make at least the problem fixable for the user without a Beluga rebuild (currently it is not).
  • Long term, fixing https://github.com/Ekumen-OS/beluga/issues/468

glpuga avatar Feb 07 '25 15:02 glpuga

This potential problem was atenuated with the fix to #468 , which improved the performance of the algorithm calculating the likelihood map from the occupancy grid of the environment.

The performance depends on the map, however, and for some maps the time it takes to build the likelihood map will still be much greater than the 4 seconds timeout identified as the problem in this ticket above.

This is true for both Beluga and Nav2, since both of them use the same hard-coded timeout value.

An example map which would cause both Beluga and Nav2 to trigger this problem is a tiled map like this: Image

After discussing this, we agreed that the best solution is to make the timeout value a ROS parameter. This is a WIP.

glpuga avatar Feb 13 '25 21:02 glpuga

This was addressed in https://github.com/Ekumen-OS/beluga/pull/473

glpuga avatar Aug 03 '25 21:08 glpuga