ros2_control
ros2_control copied to clipboard
Handle on waiting
That was easier than anticipated, fixes: #1482 and #1200
Note, I didn't touch the other except KeyboardInterrupt: statement as it actually does work when controllers are successfully loaded.
Does it make sense to add this to the unspawner script as well?
Good question! That doesn't have the same check in it.
Coming to think of this, since asking for permission instead of forgiveness is an antipattern (and actually breaks stuff: https://github.com/ros-controls/ros2_control/issues/1200).
I would opt for modifying this PR to align spawner and unspawner to both retry-and-log instead of the more expensive check-and-try.
I need some assistance here. CI is failing on:
Could not find a package configuration file provided by "test_msgs" with
any of the following names:
test_msgsConfig.cmake
But I didn't change anything related to test_msgs here?
But I didn't change anything related to
test_msgshere?
This was some issue with the rolling CI due to the transition to Ubuntu noble. The issues related to your changes are
[ RUN ] TestLoadController.spawner_without_manager_errors
...
[WARN] [1713556932.603576757] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556942.609051609] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556952.614319281] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556962.619689181] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556972.625026280] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
I didn't have a massive amount of time to look into things the last couple of days but I wanted to finally fix #1182 today. My result is in #1501. This PR basically solves the first problem from #1182, as well, so +1 from me. But seeing the
Could not contact service /controller_manager/list_controllers
indicates that there is still something missing.
Edit: Actually, looking at the test it is obvious that this fails since this PR explicitly changes the spawner not to fail if it can't find the CM while the test expects it to fail. If that's the desired behavior that test probably simply has to be removed.
I dropped the test. As it was testing if false inputs gave no result. It could be written if a warning was issued, but that would mean parsing stdout or stderr in tests. Which is also a bit frowned upon.
Discussing in the working group today, we agreed on the following things:
- The spawners should allow setting a definite timeout in order to make launchfiles actually fail if there's something wrong. Regarding the initial service wait we could use my proposal from https://github.com/ros-controls/ros2_control/pull/1501/commits/e25ea4fe862844810ec89182bc9d5edfe9c8f6e9. With that users can still define a timeout of 0 in order to make it wait indefinitely. This way this PR would also be directly backportable to Iron and Humble.
- We'll first merge this and then I'll adjust #1501 to only contain the missing changes.
@Timple Would you like to update this PR according to the first point please?
Regarding backporting, keeping the original timeout with the option of setting it to indefinitely makes sense!
However, I strongly opt for this to be the default in future releases. We've had flaky simulation and hardware startups because the default timeout was close to the boot time of the nodes. Flaky behavior seems the worst of all possible scenarios. Would this be alright?
I think @bmagyar might have an opinion on that.
In the end that boils down to the default value of the timeout. I agree that a larger default timeout than 10 seconds might make a lot of sense, but in the spirit of making it fail by default and not leave launch files hanging indefinitely it might be good to still have one.
Yes, but defaults are very important. I fail to grasp how a node printing once and failing in the startup noise is a more clear message than an error/warning being repeatedly printed.
Only scenario I can think of is some kind of catch mechanism that restarts/reboots. But very likely the result will be exactly the same, but debugging much harder as everything keeps restarting and printing a lot.
If I'm missing a scenario here, or proper scenarios were discussed at the working group, let me know!
One scenario is having launchfiles in an autostart job, e.g. a systemd unit where the shell output will never been seen by the user. In this case a failed unit is much more obvious to find than a unit repeating a log output.
That's a valid use case. If the logs of the autostart jobs are evaluated before the ROS logs.
Although I'd rather have that specific case as exception than flaky simulation and hardware startups.
We should keep timeout at it was and just add new option here.
This pull request is in conflict. Could you fix it @Timple?
Closing in favour of #1562