ros2cli
ros2cli copied to clipboard
Nodes missing from `ros2 node list` after relaunch
Bug report
Required Info:
- Operating System:
- Ubuntu 20.04
- Installation type:
- Foxy binaries
- Version or commit hash:
- ros-foxy-navigation2 0.4.5-1focal.20201210.084248
- DDS implementation:
- Fast-RTPS (default)
- Client library (if applicable):
- n/a
Steps to reproduce issue
1
From the workspace root, launch (e.g.) a TurtleBot3 simulation:
export TURTLEBOT3_MODEL=burger
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:$(pwd)/src/turtlebot3/turtlebot3_simulations/turtlebot3_gazebo/models
ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py
Then, in a second terminal, launch the navigation:
export TURTLEBOT3_MODEL=burger
ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true
Print the node list:
ros2 node list
Close (ctrl-c) the navigation and the simulation.
2
Relaunch from the same respective terminals, the simulation:
ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py
and the navigation:
ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true
Print the node list again (2nd time):
ros2 node list
Close (ctrl-c) the navigation and the simulation. Stop the ros2
daemon:
ros2 daemon stop
3
Relaunch from the same respective terminals, the simulation:
ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py
and the navigation:
ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true
Print the node list again (3rd time):
ros2 node list
Expected behavior
The node list should be the same all three times (up to some hash in the /transform_listener_impl_...
nodes).
Actual behavior
The second time, the following nodes are missing (the remainder is practically the same):
/controller_server
/controller_server_rclcpp_node
/global_costmap/global_costmap
/global_costmap/global_costmap_rclcpp_node
/global_costmap_client
/local_costmap/local_costmap
/local_costmap/local_costmap_rclcpp_node
/local_costmap_client
/planner_server
/planner_server_rclcpp_node
The third time, after stopping the daemon, it works as expected again.
Note, that everything else works fine and in case of the above navigation use case, the nodes are fully functional.
Additional information
This issue was raised here: ros-planning/navigation2#2145.
I'm seeing something similar with gazebo + ros2_control as well.
The interesting thing is that if I do:
ros2 node list
I get 0 nodes.
If I do ros2 node list --no-daemon
I get the list of nodes.
Restarting the daemon with ros2 daemon stop; ros2 daemon start
also shows all nodes.
I think that this is expected behavior for ros2 daemon, it is well described what-is-ros2-daemon.
Is it? I understood it as a cache of nodes and their subs/pubs/services etc... that should be transparent to use. But this cache is getting outdated and only restarting the daemon fixes it.
I could understand that it keeps some nodes as "alive" in the cache, as it takes some time of them being unresponsive before eliminating them. But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes. I have to restart the daemon or use the --no-daemon
flag.
Ah, i see. you are saying
But this cache is getting outdated and only restarting the daemon fixes it.
problem-1: old cache can be seen, and will not be cleaned?
But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes.
problem-2: cache does not get updated?
Am i understanding correct?
Exactly, I've seen both issues.
problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.
I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.
There may also be underlying rmw issues causing problem-2, since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli.
Probably related to https://github.com/ros2/rmw_fastrtps/issues/509.
could be related to https://github.com/ros2/rmw_fastrtps/pull/514 if the communication is localhost?
I'm seeing this bug on a project with five nodes, FastRTPS, native Ubuntu install.
I'm using ros2 launch files, everything comes up nicely the first couple of times, but eventually ros2 node list
stops seeing all of the nodes (which are definitely running).
At the same time, ros2 param
stops being able to interact with the hidden nodes,
and ros2 topic list stops
showing all of the topics.
rqt is a bit weird, there were a few time when it seemed able to find a different collection of topics and nodes to the cli tools
ros2 daemon stop; ros2 daemon start
has saved my day.
@BrettRD
if your problem is related to https://github.com/ros2/rmw_fastrtps/pull/514, it would be really appreciated to try https://github.com/ros2/ros2/tree/foxy branch to check if you still meet the problem.
@fujitatomoya I'm currently running ros2 from apt, and this is pretty tedious to replicate with any confidence, so I'd like a sanity check on a procedure.
I'll try the following:
rebuild the workspace from scratch rm -rf install/ build/
using ros from /opt/ros/foxy/setup.bash
,
reset the ros2 daemon
launch and tear down the application a bunch and count how many times it launches before ros2 node list
misses nodes
That sets an order-of-magnitude baseline for how long to test the new branch
install ros from source:
clear the workspace rm -rf install/ build/
load a new terminal without ros2 from apt
clone the ros2 repos into a folder in src
rebuild with colcon (including ros2 source packages)
load the local setup . install/setup.bash
which should reference local foxy latest
reset the ros2 daemon
repeat the launch and teardown until it drops nodes (confirmation not fixed) or until I get bored (inconclusive but reassuring)
Does that sound about right?
i think that sounds okay, and whole procedure is https://docs.ros.org/en/foxy/Installation/Linux-Development-Setup.html. i usually use ubuntu:20.04
docker container as base.
I have a result! -- Not fixed.
I built from source (55 minutes build time, after tracking down additional deps), and my build does contain ros2/rmw_fastrtps#514. I did not source /opt/ros/foxy/setup.bash, so I'm using foxy latest.
In order to trigger this bug, I have to sigint ros2 launch
before all the nodes are up
loading and closing fast enough to see duplicate nodes (which age out normally)
Once this bug is triggered, I can load the same 5-node launch file and ros2 node list
will list a random subset of the nodes from the launchfile, but always the same number until you ros2 daemon stop
, then everything goes back to normal. Other nodes like rqt and ros2 topic echo
are listed fine.
I can retrigger this bug, and the size of the subset gets smaller by one node each time. I can keep triggering it until no nodes from that launch file get listed, and eventually reloading rqt doesn't list.
Recently I've met this bug in my project, and here is what I found:
- This bug still exist in apt-version of 20221012 of foxy(with rmw_fastrtps_cpp)
-
ros2 daemon stop
andros2 daemon start
can update the nodelist effectively, but would not take effect every time, you need to try and try for couple of times. - without
ros2 daemon
operation,ros2 lifecycle set
may return error with "node not found", may this cmd depends on the output ofros2 node list
.
And I have the questions: @nielsvd @BrettRD @v-lopez
- I'm not sure why rmw could cause this problem, does changing rmw would solve this issue? @fujitatomoya I've found it happen with rmw_cyclonedds in the compiled version, https://github.com/ZhenshengLee/ros2_jetson/issues/10
- all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?
- does this issue being resolved in the future release of ros2, like galactic or humble?
I'm not sure why rmw could cause this problem, does changing rmw would solve this issue?
discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.
all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?
no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.
does this issue being resolved in the future release of ros2, like galactic or humble?
i cannot reproduce this issue with my local environment and rolling branch.
@fujitatomoya thank you for your quick reply.
discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.
Thanks for your tips, I will have a try.
no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.
OK, so rclcpp would not bypass the issue.
i cannot reproduce this issue with my local environment and rolling branch.
according to @v-lopez , only the complex launch would cause this node list problem.
I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.
I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble. I have seen https://github.com/ZhenshengLee/ros2_jetson/issues/10 in galactic
@iuhilnehc-ynos @llapx can you check if we can see this problem with rolling
, if you have bandwidth?
i think there is no easy reproducible procedure currently, but we can check with https://github.com/ros2/ros2cli/issues/582#issue-784108824 .
I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble.
@BrettRD the primary difference between Galactic and Humble/Foxy is the default rmw used.
problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.
since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli.
from my test https://github.com/ros2/ros2cli/issues/779#issuecomment-1315117834 and the comment from @v-lopez that rviz2 will bypass the issue of node missing.
I believe the root cause would not be in the rmw layer, so changing rmw will not bypass the issue, and rclcpp/rviz2 will not see this problem.
@fujitatomoya
OK, I'll take a check.
I have tested it on ros:rolling (docker), and build turtlebot3 and navigation2 (ros:rolling no providing nav2 packages) from sources, after testing for many times, it works well.
This issue is not easy to reproduce.
But it must still be there because I can reproduce this issue with rolling (the reproducible steps are similar to https://github.com/ros2/ros2cli/issues/582#issue-784108824) a few times. After stopping the ros2 daemon in step 2 of https://github.com/ros2/ros2cli/issues/582#issue-784108824, we can immediately get the correct result of the node list.
1. ros2 daemon stop (stop ros2 daemon if it ran before)
2. ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False
3. ros2 node list | wc -l (to show 31 is good currently)
4. ctrl+c to stop step 2 and then re-launch it, re-check step 3 again
Notice that the navigation demo runs well even if the ros2 node list
is incorrect.
-
I can't use
rmw_cyclonedds_cpp
to reproduce this issue. -
for
rmw_fastrtps_cpp
, as Ctrl+Cros2 launch nav2_bringup tb3_simulation_launch.py headless:=False
can't make all processes exit normally, the shared-memory files used in the Fast-DDS are not clean successfully. I don't know if it's the root cause to make theros2 daemon
not update thenode_listener
->rmw_dds_common::GraphCache::update_participant_entities
anymore. -
some information about
ros2 daemon
- top info of ros2 daemon
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3648025 chenlh 20 0 667912 79412 47136 R 99.7 0.2 4:02.62 python3 # almost 100% CPU usage
3648022 chenlh 20 0 667912 79412 47136 S 0.3 0.2 0:03.56 python3
3647989 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.40 python3
3648019 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.00 python3
3648020 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.00 python3
3648021 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.01 python3
3648023 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.08 python3
3648024 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.00 python3
3648026 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.00 python3
3648027 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.05 python3
3648028 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.00 python3
3648029 chenlh 20 0 667912 79412 47136 S 0.0 0.2 0:00.02 python3
- thread info of ros2 daemon
to find out the thread 3648025
is Id 8
(gdb) info thread
Id Target Id Frame
* 1 Thread 0x7faf51f801c0 (LWP 3647989) "python3" 0x00007faf52099d7f in __GI___poll (fds=0x7faf513bbae0, nfds=1, timeout=7200000)
at ../sysdeps/unix/sysv/linux/poll.c:29
2 Thread 0x7faf4c282640 (LWP 3648019) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393,
expected=0, futex_word=0x7faf50ceb000 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
3 Thread 0x7faf4ba81640 (LWP 3648020) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7faf4ba80de0, op=137,
expected=0, futex_word=0x55e32f872ae0) at ./nptl/futex-internal.c:57
4 Thread 0x7faf4b280640 (LWP 3648021) "python3" __futex_abstimed_wait_common64 (private=290346745, cancel=true, abstime=0x7faf4b27fc10, op=137,
expected=0, futex_word=0x55e32feb7760) at ./nptl/futex-internal.c:57
5 Thread 0x7faf4a9f8640 (LWP 3648022) "python3" __futex_abstimed_wait_common64 (private=1326168272, cancel=true, abstime=0x7faf4a9f7c10, op=137,
expected=0, futex_word=0x55e32ff19bcc) at ./nptl/futex-internal.c:57
6 Thread 0x7faf4a1f7640 (LWP 3648023) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=17, buf=0x55e32ff1c570, len=65500, flags=0, addr=...,
addrlen=0x7faf4a1f6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
7 Thread 0x7faf499f6640 (LWP 3648024) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=18, buf=0x55e32ff2cd90, len=65500, flags=0, addr=...,
addrlen=0x7faf499f5a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
8 Thread 0x7faf491e8640 (LWP 3648025) "python3" 0x00007faf500de664 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
9 Thread 0x7faf489e7640 (LWP 3648026) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=20, buf=0x55e32ff40070, len=65500, flags=0, addr=...,
addrlen=0x7faf489e6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
10 Thread 0x7faf481d9640 (LWP 3648027) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7faf481d8940,
op=265, expected=0, futex_word=0x7faf470c9110) at ./nptl/futex-internal.c:57
11 Thread 0x7faf478f8640 (LWP 3648028) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x55e32ff54a28) at ./nptl/futex-internal.c:57
12 Thread 0x7faf46d57640 (LWP 3648029) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x7faf30000c04) at ./nptl/futex-internal.c:57
the backtrace for thread Id 8,
(gdb) thread 8
[Switching to thread 8 (Thread 0x7faf491e8640 (LWP 3648025))]
#0 0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) bt
#0 0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1 0x00007faf4f6b4163 in eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55e32fd29aa0, id=...)
at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1282
#2 0x00007faf4f6b22f1 in eprosima::fastdds::rtps::SharedMemManager::Listener::pop (this=0x55e32ff2ccf0)
at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:711
#3 0x00007faf4f6b58fb in eprosima::fastdds::rtps::SharedMemChannelResource::Receive (this=0x55e32fe3b100, remote_locator=...)
at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:182
#4 0x00007faf4f6b556e in eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation (this=0x55e32fe3b100, input_locator=...)
at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:133
#5 0x00007faf4f6d0579 in std::__invoke_impl<void, void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
__f=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>, __t=@0x55e32ff3fa70: 0x55e32fe3b100) at /usr/include/c++/11/bits/invoke.h:74
#6 0x00007faf4f6d00e2 in std::__invoke<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
__fn=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>) at /usr/include/c++/11/bits/invoke.h:96
#7 0x00007faf4f6cfeb3 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::_M_invoke<0ul, 1ul, 2ul> (this=0x55e32ff3fa58)
at /usr/include/c++/11/bits/std_thread.h:253
#8 0x00007faf4f6cf952 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::operator() (this=0x55e32ff3fa58)
at /usr/include/c++/11/bits/std_thread.h:260
#9 0x00007faf4f6cf218 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> > >::_M_run (this=0x55e32ff3fa50)
at /usr/include/c++/11/bits/std_thread.h:211
#10 0x00007faf501c42b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007faf52015b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007faf520a7a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
https://github.com/eProsima/Fast-DDS/blob/7e12e8fe2cebf27c621263fa544f94b099504808/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp#L128-L136
void perform_listen_operation(
Locator input_locator)
{
Locator remote_locator;
while (alive())
{
// Blocking receive.
std::shared_ptr<SharedMemManager::Buffer> message;
if (!(message = Receive(remote_locator)))
//////\ expect that the `Receive` can block if there is no data, but it will try to Receive the nullptr message again and again.
{
continue;
}
failed to Receive
by pop
the message as find_segment
throws an exception inside.
I don't know whether it's a bug or not because I can't reproduce this issue the first time after clearing the related shm files /dev/shm/*fastrtps*
.
could be related to https://github.com/eProsima/Fast-DDS/issues/2790
@iuhilnehc-ynos a couple of questions.
can't make all processes exit normally
can you point out which node or processes cannot exit normally? is that receiving exception or core crash?
I can't reproduce this issue the first time after clearing the related shm files /dev/shm/fastrtps.
i think this is good step that we found out.
- Is that always the same node which cannot be listed or random node?
- if we add the procedure
fastdds shm clean
in this procedure, problem cannot happen?
can you point out which node or processes cannot exit normally? is that receiving exception or core crash?
Press ctrl+c
for ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False
has different behavior each time, but most errors are from rviz2
and component_container_isolated
, which might be killed by ros2 launch
.
Is that always the same node which cannot be listed or random node?
It shows a random node list, but if the issue happens, the node list is almost the same as the prior while running the tb3_simulation_launch.py
again, but some node names with new IDs are refreshed, such as the launch node /launch_ros_{a_new_pid}
.
- if we add the procedure
fastdds shm clean
in this procedure, problem cannot happen?
No, I tried using fastdds shm clean
, but it is not enough because shared memory files for data communication are used in the node of ros2 daemon
. I must stop ros2 daemon
.
BTW: I think it's not difficult to reproduce this issue. Please don't be polite to the tb3_simulation_launch.py
(Press ctrl+c any time you can to stop it and rerun it immediately). I have confirmed this issue with both humble
and rolling
.
I hope you guys can reproduce this issue on your machine, otherwise, nobody can help confirm even if I have a workaround patch :smile: .
@JLBuenoLopez-eProsima @MiguelCompany any thoughts? i believe that it is clear that shared memory file or caches used by ros2 daemon
is related to the issue.
I had an issue calling ros2 node list
from another terminal using a python script. On occasions, there would be missing nodes at the first call, but subsequent calls would populate the node list correctly.
I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences. What I found what worked was adding --spin-time
parameter in the call: ros2 node list --spin-time 5
That always seemed to populate the node list correctly. I hope this helps others.
What does --spin-time
do?
--spin-time SPIN_TIME Spin time in seconds to wait for discovery (only applies when not using an already running daemon)
I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences.
downside could be discovery time for any other nodes running on that host system. daemon caches and advertises ros 2 network graph in it, then if the daemon is running, other ros 2 nodes running in the same host can find the connectivity to request the daemon without waiting entire discovery.
What does --spin-time do?
we can use this option to wait for ros 2 network graph updated until specific timeout expires. but this option is only valid when daemon is not running or --no-daemon
option is specified.