ros2cli icon indicating copy to clipboard operation
ros2cli copied to clipboard

Nodes missing from `ros2 node list` after relaunch

Open nielsvd opened this issue 4 years ago • 23 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • Foxy binaries
  • Version or commit hash:
    • ros-foxy-navigation2 0.4.5-1focal.20201210.084248
  • DDS implementation:
    • Fast-RTPS (default)
  • Client library (if applicable):
    • n/a

Steps to reproduce issue

1

From the workspace root, launch (e.g.) a TurtleBot3 simulation:

export TURTLEBOT3_MODEL=burger
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:$(pwd)/src/turtlebot3/turtlebot3_simulations/turtlebot3_gazebo/models
ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

Then, in a second terminal, launch the navigation:

export TURTLEBOT3_MODEL=burger
ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list:

ros2 node list

Close (ctrl-c) the navigation and the simulation.

2

Relaunch from the same respective terminals, the simulation:

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

and the navigation:

ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list again (2nd time):

ros2 node list

Close (ctrl-c) the navigation and the simulation. Stop the ros2 daemon:

ros2 daemon stop
3

Relaunch from the same respective terminals, the simulation:

ros2 launch turtlebot3_gazebo turtlebot3_world.launch.py

and the navigation:

ros2 launch turtlebot3_navigation2 navigation2.launch.py use_sim_time:=true

Print the node list again (3rd time):

ros2 node list

Expected behavior

The node list should be the same all three times (up to some hash in the /transform_listener_impl_... nodes).

Actual behavior

The second time, the following nodes are missing (the remainder is practically the same):

/controller_server
/controller_server_rclcpp_node
/global_costmap/global_costmap
/global_costmap/global_costmap_rclcpp_node
/global_costmap_client
/local_costmap/local_costmap
/local_costmap/local_costmap_rclcpp_node
/local_costmap_client
/planner_server
/planner_server_rclcpp_node

The third time, after stopping the daemon, it works as expected again.

Note, that everything else works fine and in case of the above navigation use case, the nodes are fully functional.

Additional information

This issue was raised here: ros-planning/navigation2#2145.

nielsvd avatar Jan 12 '21 10:01 nielsvd

I'm seeing something similar with gazebo + ros2_control as well.

The interesting thing is that if I do: ros2 node list I get 0 nodes.

If I do ros2 node list --no-daemon I get the list of nodes.

Restarting the daemon with ros2 daemon stop; ros2 daemon start also shows all nodes.

v-lopez avatar Feb 09 '21 14:02 v-lopez

I think that this is expected behavior for ros2 daemon, it is well described what-is-ros2-daemon.

fujitatomoya avatar Feb 10 '21 00:02 fujitatomoya

Is it? I understood it as a cache of nodes and their subs/pubs/services etc... that should be transparent to use. But this cache is getting outdated and only restarting the daemon fixes it.

I could understand that it keeps some nodes as "alive" in the cache, as it takes some time of them being unresponsive before eliminating them. But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes. I have to restart the daemon or use the --no-daemon flag.

v-lopez avatar Feb 10 '21 08:02 v-lopez

Ah, i see. you are saying

But this cache is getting outdated and only restarting the daemon fixes it.

problem-1: old cache can be seen, and will not be cleaned?

But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes.

problem-2: cache does not get updated?

Am i understanding correct?

fujitatomoya avatar Feb 10 '21 08:02 fujitatomoya

Exactly, I've seen both issues.

problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.

I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.

There may also be underlying rmw issues causing problem-2, since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli.

v-lopez avatar Feb 10 '21 08:02 v-lopez

Probably related to https://github.com/ros2/rmw_fastrtps/issues/509.

audrow avatar Mar 04 '21 18:03 audrow

could be related to https://github.com/ros2/rmw_fastrtps/pull/514 if the communication is localhost?

fujitatomoya avatar Mar 05 '21 00:03 fujitatomoya

I'm seeing this bug on a project with five nodes, FastRTPS, native Ubuntu install.

I'm using ros2 launch files, everything comes up nicely the first couple of times, but eventually ros2 node list stops seeing all of the nodes (which are definitely running). At the same time, ros2 param stops being able to interact with the hidden nodes, and ros2 topic list stops showing all of the topics.

rqt is a bit weird, there were a few time when it seemed able to find a different collection of topics and nodes to the cli tools

ros2 daemon stop; ros2 daemon start has saved my day.

BrettRD avatar Mar 09 '21 05:03 BrettRD

@BrettRD

if your problem is related to https://github.com/ros2/rmw_fastrtps/pull/514, it would be really appreciated to try https://github.com/ros2/ros2/tree/foxy branch to check if you still meet the problem.

fujitatomoya avatar Mar 09 '21 05:03 fujitatomoya

@fujitatomoya I'm currently running ros2 from apt, and this is pretty tedious to replicate with any confidence, so I'd like a sanity check on a procedure.

I'll try the following: rebuild the workspace from scratch rm -rf install/ build/ using ros from /opt/ros/foxy/setup.bash, reset the ros2 daemon launch and tear down the application a bunch and count how many times it launches before ros2 node list misses nodes

That sets an order-of-magnitude baseline for how long to test the new branch

install ros from source: clear the workspace rm -rf install/ build/ load a new terminal without ros2 from apt clone the ros2 repos into a folder in src

rebuild with colcon (including ros2 source packages) load the local setup . install/setup.bash which should reference local foxy latest reset the ros2 daemon repeat the launch and teardown until it drops nodes (confirmation not fixed) or until I get bored (inconclusive but reassuring)

Does that sound about right?

BrettRD avatar Mar 09 '21 07:03 BrettRD

i think that sounds okay, and whole procedure is https://docs.ros.org/en/foxy/Installation/Linux-Development-Setup.html. i usually use ubuntu:20.04 docker container as base.

fujitatomoya avatar Mar 09 '21 08:03 fujitatomoya

I have a result! -- Not fixed.

I built from source (55 minutes build time, after tracking down additional deps), and my build does contain ros2/rmw_fastrtps#514. I did not source /opt/ros/foxy/setup.bash, so I'm using foxy latest.

In order to trigger this bug, I have to sigint ros2 launch before all the nodes are up loading and closing fast enough to see duplicate nodes (which age out normally)

Once this bug is triggered, I can load the same 5-node launch file and ros2 node list will list a random subset of the nodes from the launchfile, but always the same number until you ros2 daemon stop, then everything goes back to normal. Other nodes like rqt and ros2 topic echo are listed fine.

I can retrigger this bug, and the size of the subset gets smaller by one node each time. I can keep triggering it until no nodes from that launch file get listed, and eventually reloading rqt doesn't list.

BrettRD avatar Mar 12 '21 05:03 BrettRD

Recently I've met this bug in my project, and here is what I found:

  • This bug still exist in apt-version of 20221012 of foxy(with rmw_fastrtps_cpp)
  • ros2 daemon stop and ros2 daemon start can update the nodelist effectively, but would not take effect every time, you need to try and try for couple of times.
  • without ros2 daemon operation, ros2 lifecycle set may return error with "node not found", may this cmd depends on the output of ros2 node list.

And I have the questions: @nielsvd @BrettRD @v-lopez

  1. I'm not sure why rmw could cause this problem, does changing rmw would solve this issue? @fujitatomoya I've found it happen with rmw_cyclonedds in the compiled version, https://github.com/ZhenshengLee/ros2_jetson/issues/10
  2. all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?
  3. does this issue being resolved in the future release of ros2, like galactic or humble?

ZhenshengLee avatar Nov 14 '22 10:11 ZhenshengLee

I'm not sure why rmw could cause this problem, does changing rmw would solve this issue?

discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.

all ros2cli depends on rclpy, may using rclcpp would be a workaround way to bypass this issue?

no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.

does this issue being resolved in the future release of ros2, like galactic or humble?

i cannot reproduce this issue with my local environment and rolling branch.

fujitatomoya avatar Nov 14 '22 22:11 fujitatomoya

@fujitatomoya thank you for your quick reply.

discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.

Thanks for your tips, I will have a try.

no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.

OK, so rclcpp would not bypass the issue.

i cannot reproduce this issue with my local environment and rolling branch.

according to @v-lopez , only the complex launch would cause this node list problem.

I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2.

ZhenshengLee avatar Nov 15 '22 01:11 ZhenshengLee

I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble. I have seen https://github.com/ZhenshengLee/ros2_jetson/issues/10 in galactic

BrettRD avatar Nov 15 '22 04:11 BrettRD

@iuhilnehc-ynos @llapx can you check if we can see this problem with rolling, if you have bandwidth?

i think there is no easy reproducible procedure currently, but we can check with https://github.com/ros2/ros2cli/issues/582#issue-784108824 .

fujitatomoya avatar Nov 15 '22 17:11 fujitatomoya

I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble.

@BrettRD the primary difference between Galactic and Humble/Foxy is the default rmw used.

ZhenshengLee avatar Nov 16 '22 00:11 ZhenshengLee

problem-1: Cache (daemon) retaining nodes killed long ago. problem-2: Cache (daemon) not adding new nodes.

since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli.

from my test https://github.com/ros2/ros2cli/issues/779#issuecomment-1315117834 and the comment from @v-lopez that rviz2 will bypass the issue of node missing.

I believe the root cause would not be in the rmw layer, so changing rmw will not bypass the issue, and rclcpp/rviz2 will not see this problem.

ZhenshengLee avatar Nov 16 '22 02:11 ZhenshengLee

@fujitatomoya

OK, I'll take a check.

llapx avatar Nov 16 '22 05:11 llapx

I have tested it on ros:rolling (docker), and build turtlebot3 and navigation2 (ros:rolling no providing nav2 packages) from sources, after testing for many times, it works well.

llapx avatar Nov 18 '22 02:11 llapx

This issue is not easy to reproduce.

But it must still be there because I can reproduce this issue with rolling (the reproducible steps are similar to https://github.com/ros2/ros2cli/issues/582#issue-784108824) a few times. After stopping the ros2 daemon in step 2 of https://github.com/ros2/ros2cli/issues/582#issue-784108824, we can immediately get the correct result of the node list.

1. ros2 daemon stop (stop ros2 daemon if it ran before)
2. ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False
3. ros2 node list | wc -l (to show 31 is good currently)
4. ctrl+c to stop step 2 and then re-launch it, re-check step 3 again

Notice that the navigation demo runs well even if the ros2 node list is incorrect.

image

iuhilnehc-ynos avatar Nov 18 '22 06:11 iuhilnehc-ynos

  1. I can't use rmw_cyclonedds_cpp to reproduce this issue.

  2. for rmw_fastrtps_cpp, as Ctrl+C ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False can't make all processes exit normally, the shared-memory files used in the Fast-DDS are not clean successfully. I don't know if it's the root cause to make the ros2 daemon not update the node_listener -> rmw_dds_common::GraphCache::update_participant_entities anymore.

  3. some information about ros2 daemon

  • top info of ros2 daemon
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3648025 chenlh    20   0  667912  79412  47136 R  99.7   0.2   4:02.62 python3       # almost 100% CPU usage
3648022 chenlh    20   0  667912  79412  47136 S   0.3   0.2   0:03.56 python3
3647989 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.40 python3
3648019 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648020 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648021 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.01 python3
3648023 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.08 python3
3648024 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648026 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648027 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.05 python3
3648028 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.00 python3
3648029 chenlh    20   0  667912  79412  47136 S   0.0   0.2   0:00.02 python3
  • thread info of ros2 daemon

to find out the thread 3648025 is Id 8

(gdb) info thread
  Id   Target Id                                     Frame 
* 1    Thread 0x7faf51f801c0 (LWP 3647989) "python3" 0x00007faf52099d7f in __GI___poll (fds=0x7faf513bbae0, nfds=1, timeout=7200000)
    at ../sysdeps/unix/sysv/linux/poll.c:29
  2    Thread 0x7faf4c282640 (LWP 3648019) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, 
    expected=0, futex_word=0x7faf50ceb000 <(anonymous namespace)::g_signal_handler_sem>) at ./nptl/futex-internal.c:57
  3    Thread 0x7faf4ba81640 (LWP 3648020) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7faf4ba80de0, op=137, 
    expected=0, futex_word=0x55e32f872ae0) at ./nptl/futex-internal.c:57
  4    Thread 0x7faf4b280640 (LWP 3648021) "python3" __futex_abstimed_wait_common64 (private=290346745, cancel=true, abstime=0x7faf4b27fc10, op=137, 
    expected=0, futex_word=0x55e32feb7760) at ./nptl/futex-internal.c:57
  5    Thread 0x7faf4a9f8640 (LWP 3648022) "python3" __futex_abstimed_wait_common64 (private=1326168272, cancel=true, abstime=0x7faf4a9f7c10, op=137, 
    expected=0, futex_word=0x55e32ff19bcc) at ./nptl/futex-internal.c:57
  6    Thread 0x7faf4a1f7640 (LWP 3648023) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=17, buf=0x55e32ff1c570, len=65500, flags=0, addr=..., 
    addrlen=0x7faf4a1f6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  7    Thread 0x7faf499f6640 (LWP 3648024) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=18, buf=0x55e32ff2cd90, len=65500, flags=0, addr=..., 
    addrlen=0x7faf499f5a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  8    Thread 0x7faf491e8640 (LWP 3648025) "python3" 0x00007faf500de664 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
  9    Thread 0x7faf489e7640 (LWP 3648026) "python3" 0x00007faf520a8934 in __libc_recvfrom (fd=20, buf=0x55e32ff40070, len=65500, flags=0, addr=..., 
    addrlen=0x7faf489e6a0c) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  10   Thread 0x7faf481d9640 (LWP 3648027) "python3" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7faf481d8940, 
    op=265, expected=0, futex_word=0x7faf470c9110) at ./nptl/futex-internal.c:57
  11   Thread 0x7faf478f8640 (LWP 3648028) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55e32ff54a28) at ./nptl/futex-internal.c:57
  12   Thread 0x7faf46d57640 (LWP 3648029) "python3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7faf30000c04) at ./nptl/futex-internal.c:57

the backtrace for thread Id 8,

(gdb) thread 8
[Switching to thread 8 (Thread 0x7faf491e8640 (LWP 3648025))]
#0  0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) bt
#0  0x00007faf500df636 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#1  0x00007faf4f6b4163 in eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55e32fd29aa0, id=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1282
#2  0x00007faf4f6b22f1 in eprosima::fastdds::rtps::SharedMemManager::Listener::pop (this=0x55e32ff2ccf0)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:711
#3  0x00007faf4f6b58fb in eprosima::fastdds::rtps::SharedMemChannelResource::Receive (this=0x55e32fe3b100, remote_locator=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:182
#4  0x00007faf4f6b556e in eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation (this=0x55e32fe3b100, input_locator=...)
    at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp:133
#5  0x00007faf4f6d0579 in std::__invoke_impl<void, void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __f=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>, __t=@0x55e32ff3fa70: 0x55e32fe3b100) at /usr/include/c++/11/bits/invoke.h:74
#6  0x00007faf4f6d00e2 in std::__invoke<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __fn=@0x55e32ff3fa78: (void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastdds::rtps::SharedMemChannelResource * const, eprosima::fastrtps::rtps::Locator_t)) 0x7faf4f6b54e4 <eprosima::fastdds::rtps::SharedMemChannelResource::perform_listen_operation(eprosima::fastrtps::rtps::Locator_t)>) at /usr/include/c++/11/bits/invoke.h:96
#7  0x00007faf4f6cfeb3 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::_M_invoke<0ul, 1ul, 2ul> (this=0x55e32ff3fa58)
    at /usr/include/c++/11/bits/std_thread.h:253
#8  0x00007faf4f6cf952 in std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::operator() (this=0x55e32ff3fa58)
    at /usr/include/c++/11/bits/std_thread.h:260
#9  0x00007faf4f6cf218 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::SharedMemChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::SharedMemChannelResource*, eprosima::fastrtps::rtps::Locator_t> > >::_M_run (this=0x55e32ff3fa50)
    at /usr/include/c++/11/bits/std_thread.h:211
#10 0x00007faf501c42b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007faf52015b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007faf520a7a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

https://github.com/eProsima/Fast-DDS/blob/7e12e8fe2cebf27c621263fa544f94b099504808/src/cpp/rtps/transport/shared_mem/SharedMemChannelResource.hpp#L128-L136

    void perform_listen_operation(
            Locator input_locator)
    {
        Locator remote_locator;

        while (alive())
        {
            // Blocking receive.
            std::shared_ptr<SharedMemManager::Buffer> message;

            if (!(message = Receive(remote_locator)))
                            //////\ expect that the `Receive` can block if there is no data, but it will try to Receive the nullptr message again and again.
            {
                continue;
            }

failed to Receive by pop the message as find_segment throws an exception inside.

I don't know whether it's a bug or not because I can't reproduce this issue the first time after clearing the related shm files /dev/shm/*fastrtps*.

iuhilnehc-ynos avatar Nov 21 '22 10:11 iuhilnehc-ynos

could be related to https://github.com/eProsima/Fast-DDS/issues/2790

fujitatomoya avatar Dec 01 '22 21:12 fujitatomoya

@iuhilnehc-ynos a couple of questions.

can't make all processes exit normally

can you point out which node or processes cannot exit normally? is that receiving exception or core crash?

I can't reproduce this issue the first time after clearing the related shm files /dev/shm/fastrtps.

i think this is good step that we found out.

  • Is that always the same node which cannot be listed or random node?
  • if we add the procedure fastdds shm clean in this procedure, problem cannot happen?

fujitatomoya avatar Dec 01 '22 21:12 fujitatomoya

can you point out which node or processes cannot exit normally? is that receiving exception or core crash?

Press ctrl+c for ros2 launch nav2_bringup tb3_simulation_launch.py headless:=False has different behavior each time, but most errors are from rviz2 and component_container_isolated, which might be killed by ros2 launch.

Is that always the same node which cannot be listed or random node?

It shows a random node list, but if the issue happens, the node list is almost the same as the prior while running the tb3_simulation_launch.py again, but some node names with new IDs are refreshed, such as the launch node /launch_ros_{a_new_pid}.

  • if we add the procedure fastdds shm clean in this procedure, problem cannot happen?

No, I tried using fastdds shm clean, but it is not enough because shared memory files for data communication are used in the node of ros2 daemon. I must stop ros2 daemon.

BTW: I think it's not difficult to reproduce this issue. Please don't be polite to the tb3_simulation_launch.py(Press ctrl+c any time you can to stop it and rerun it immediately). I have confirmed this issue with both humble and rolling.

iuhilnehc-ynos avatar Dec 02 '22 06:12 iuhilnehc-ynos

I hope you guys can reproduce this issue on your machine, otherwise, nobody can help confirm even if I have a workaround patch :smile: .

iuhilnehc-ynos avatar Dec 02 '22 09:12 iuhilnehc-ynos

@JLBuenoLopez-eProsima @MiguelCompany any thoughts? i believe that it is clear that shared memory file or caches used by ros2 daemon is related to the issue.

fujitatomoya avatar Dec 02 '22 17:12 fujitatomoya

I had an issue calling ros2 node list from another terminal using a python script. On occasions, there would be missing nodes at the first call, but subsequent calls would populate the node list correctly.

I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences. What I found what worked was adding --spin-time parameter in the call: ros2 node list --spin-time 5 That always seemed to populate the node list correctly. I hope this helps others.

What does --spin-time do?

--spin-time SPIN_TIME Spin time in seconds to wait for discovery (only applies when not using an already running daemon)

billyliuschill avatar Mar 13 '23 18:03 billyliuschill

I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences.

downside could be discovery time for any other nodes running on that host system. daemon caches and advertises ros 2 network graph in it, then if the daemon is running, other ros 2 nodes running in the same host can find the connectivity to request the daemon without waiting entire discovery.

What does --spin-time do?

we can use this option to wait for ros 2 network graph updated until specific timeout expires. but this option is only valid when daemon is not running or --no-daemon option is specified.

fujitatomoya avatar Mar 13 '23 20:03 fujitatomoya