ros2_control icon indicating copy to clipboard operation
ros2_control copied to clipboard

Handle on waiting

Open Timple opened this issue 1 year ago • 14 comments

That was easier than anticipated, fixes: #1482 and #1200

Timple avatar Apr 16 '24 09:04 Timple

Note, I didn't touch the other except KeyboardInterrupt: statement as it actually does work when controllers are successfully loaded.

Timple avatar Apr 16 '24 09:04 Timple

Does it make sense to add this to the unspawner script as well?

christophfroehlich avatar Apr 16 '24 09:04 christophfroehlich

Good question! That doesn't have the same check in it.

Coming to think of this, since asking for permission instead of forgiveness is an antipattern (and actually breaks stuff: https://github.com/ros-controls/ros2_control/issues/1200).

I would opt for modifying this PR to align spawner and unspawner to both retry-and-log instead of the more expensive check-and-try.

Timple avatar Apr 16 '24 12:04 Timple

I need some assistance here. CI is failing on:

    Could not find a package configuration file provided by "test_msgs" with
    any of the following names:
  
      test_msgsConfig.cmake

But I didn't change anything related to test_msgs here?

Timple avatar Apr 19 '24 08:04 Timple

But I didn't change anything related to test_msgs here?

This was some issue with the rolling CI due to the transition to Ubuntu noble. The issues related to your changes are

[ RUN      ] TestLoadController.spawner_without_manager_errors
...
[WARN] [1713556932.603576757] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556942.609051609] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556952.614319281] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556962.619689181] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers
[WARN] [1713556972.625026280] [spawner_ctrl_1]: Could not contact service /controller_manager/list_controllers

christophfroehlich avatar Apr 19 '24 20:04 christophfroehlich

I didn't have a massive amount of time to look into things the last couple of days but I wanted to finally fix #1182 today. My result is in #1501. This PR basically solves the first problem from #1182, as well, so +1 from me. But seeing the

Could not contact service /controller_manager/list_controllers

indicates that there is still something missing.

Edit: Actually, looking at the test it is obvious that this fails since this PR explicitly changes the spawner not to fail if it can't find the CM while the test expects it to fail. If that's the desired behavior that test probably simply has to be removed.

fmauch avatar Apr 23 '24 11:04 fmauch

I dropped the test. As it was testing if false inputs gave no result. It could be written if a warning was issued, but that would mean parsing stdout or stderr in tests. Which is also a bit frowned upon.

Timple avatar Apr 23 '24 17:04 Timple

Discussing in the working group today, we agreed on the following things:

  • The spawners should allow setting a definite timeout in order to make launchfiles actually fail if there's something wrong. Regarding the initial service wait we could use my proposal from https://github.com/ros-controls/ros2_control/pull/1501/commits/e25ea4fe862844810ec89182bc9d5edfe9c8f6e9. With that users can still define a timeout of 0 in order to make it wait indefinitely. This way this PR would also be directly backportable to Iron and Humble.
  • We'll first merge this and then I'll adjust #1501 to only contain the missing changes.

@Timple Would you like to update this PR according to the first point please?

fmauch avatar Apr 24 '24 18:04 fmauch

Regarding backporting, keeping the original timeout with the option of setting it to indefinitely makes sense!

However, I strongly opt for this to be the default in future releases. We've had flaky simulation and hardware startups because the default timeout was close to the boot time of the nodes. Flaky behavior seems the worst of all possible scenarios. Would this be alright?

Timple avatar Apr 24 '24 19:04 Timple

I think @bmagyar might have an opinion on that.

In the end that boils down to the default value of the timeout. I agree that a larger default timeout than 10 seconds might make a lot of sense, but in the spirit of making it fail by default and not leave launch files hanging indefinitely it might be good to still have one.

fmauch avatar Apr 24 '24 19:04 fmauch

Yes, but defaults are very important. I fail to grasp how a node printing once and failing in the startup noise is a more clear message than an error/warning being repeatedly printed.

Only scenario I can think of is some kind of catch mechanism that restarts/reboots. But very likely the result will be exactly the same, but debugging much harder as everything keeps restarting and printing a lot.

If I'm missing a scenario here, or proper scenarios were discussed at the working group, let me know!

Timple avatar Apr 24 '24 19:04 Timple

One scenario is having launchfiles in an autostart job, e.g. a systemd unit where the shell output will never been seen by the user. In this case a failed unit is much more obvious to find than a unit repeating a log output.

fmauch avatar Apr 25 '24 10:04 fmauch

That's a valid use case. If the logs of the autostart jobs are evaluated before the ROS logs.

Although I'd rather have that specific case as exception than flaky simulation and hardware startups.

Timple avatar Apr 25 '24 15:04 Timple

We should keep timeout at it was and just add new option here.

destogl avatar May 08 '24 17:05 destogl

This pull request is in conflict. Could you fix it @Timple?

mergify[bot] avatar Jun 05 '24 17:06 mergify[bot]

Closing in favour of #1562

bmagyar avatar Jun 05 '24 17:06 bmagyar