rclcpp icon indicating copy to clipboard operation
rclcpp copied to clipboard

Asynchronously shutting down while using `rclcpp::spin` `rclcpp:spin_some` in another thread throws an exception

Open ivanpauno opened this issue 4 years ago • 36 comments

This nightly failure is an example of the problem:

  • https://ci.ros2.org/view/nightly/job/nightly_osx_repeated/1968/testReport/junit/(root)/projectroot/test_lifecycle_service_client/

What is happening is that one thread is creating an executor, while other is trying to shutdown. If shutdown is called before the creation of the Executor, an exception is thrown:

https://github.com/ros2/rclcpp/blob/87bb9f9758dce239d37256ae91d58ce6325f405b/rclcpp_lifecycle/test/test_lifecycle_service_client.cpp#L218-L231

ivanpauno avatar May 27 '20 19:05 ivanpauno

@hidmic @wjwwood @jacobperron does this look like an user error (i.e. the test is not well written) or should be improve how shutdown works?

ivanpauno avatar May 27 '20 19:05 ivanpauno

I'd say user error.

hidmic avatar May 27 '20 20:05 hidmic

Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid?

jacobperron avatar May 27 '20 21:05 jacobperron

Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid?

The thing is that spin_some is called in a loop, "protected" in a rclcpp::ok check. If the context is shutdown in the middle of the call to rclcpp::ok and the construction of the executor in rclcpp::spin_some, then the error happens.

IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).

IMO shutdown shouldn't affect executors/publishers/subscriptions/nodes/... etc, it should only make rclcpp::ok() return false and wake up executors.

In that way, we could shutdown asynchronously without this error (or other bunch of possible of similar errors).

ivanpauno avatar May 27 '20 21:05 ivanpauno

I also would classify it as "user error", but not necessarily the user's fault. It's confusing and so the existing story could be improved.

This has been discussed in the past:

  • https://github.com/ros2/rclcpp/issues/812
  • https://github.com/ros2/rclcpp/issues/610
  • https://github.com/ros2/ros2/issues/369#issuecomment-518779552

And perhaps others, though I couldn't find them. I know @sloretz and I spoke about it in person at length around the time https://github.com/ros2/rmw/pull/154 was merged.

And this might be a duplicate?

  • https://github.com/ros2/rclcpp/issues/1111

Originally I thought about shutdown like KeyboardInterrupt in Python, but since this is C++ and the user isn't always inside on of our functions we cannot ensure it is raised everywhere. Instead we have (some of ) our functions raise and then the user can check rclcpp::ok() to know when to interrupt their own work.

The thinking was that if the user wrote something like this:

while (true) {
  rclcpp::spin_some(...);
}

Then they wouldn't get stuck in a loop due the mistake (imo) of not checking rclcpp::ok(). I think this fail fast approach will result in the fewest unexplained hangs for users.

However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example.

The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.

wjwwood avatar May 27 '20 23:05 wjwwood

IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).

This is a bad assumption, imo. It's like acquiring a lock, releasing it, and then assuming afterwards the resource is safe. I understand how it's an easy mistake to make, but imo it is a mistake. rclcpp::ok() just checks a state, it doesn't hold a resource or block other actions, and it doesn't indicate that to be the case.

wjwwood avatar May 27 '20 23:05 wjwwood

Transferred the issue to rclcpp, as I opened it in rmw by mistake.

ivanpauno avatar May 28 '20 12:05 ivanpauno

This has been discussed in the past:

  • https://github.com/ros2/rclcpp/issues/812
  • https://github.com/ros2/rclcpp/issues/610
  • https://github.com/ros2/ros2/issues/369#issuecomment-518779552

And this might be a duplicate?

  • https://github.com/ros2/rclcpp/issues/1111

Thanks for pointing out the duplicates. I will try to consolidate all of them in a new issue --or edit an old one--, and close the others.

Then they wouldn't get stuck in a loop due the mistake (imo) of not checking rclcpp::ok(). I think this fail fast approach will result in the fewest unexplained hangs for users.

However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example.

The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.

I think that both "exception like" and "notification like" behaviors are acceptable and have their pro/cons. But IMO, we're doing neither of both correctly. If we prefer throwing an exception when the user tries "doing something" after shutdown, I think we should everywhere throw the same exception. e.g.: In the case I pointed out, we're just throwing an RCLError with a custom message, how do I distinguish that from an actual error? I also think that we should add the try/catch in our examples/demos.

ivanpauno avatar May 28 '20 12:05 ivanpauno

@ivanpauno

I also think that we should add the try/catch in our examples/demos.

Just FYI, I think that I already did it with https://github.com/ros2/examples/pull/270.

fujitatomoya avatar May 29 '20 07:05 fujitatomoya

@ivanpauno

I also think that we should add the try/catch in our examples/demos.

Just FYI, I think that I already did it with ros2/examples#270.

That's exactly what I don't like of how shutdown works right now, you only can try/catch a generic RCLError. There should be a rclcpp::exceptions::Shutdown error (or similar).

ivanpauno avatar May 29 '20 13:05 ivanpauno

We already have in rcl return values like RCL_RET_NODE_INVALID, RCL_RET_PUBLISHER_INVALID, etc.

After shutdown, you usually get one of those errors (e.g. https://github.com/ros2/rcl/blob/069d1f07290019b62a8297c5f95a80b58e63682a/rcl/src/rcl/publisher.c#L286). Can we add a RCL_RET_IS_SHUTDOWN or RCL_RET_CONTEXT_INVALID in rcl and return that when appropriate?

We could then add in rclcpp a ShutdownError derived from RCLError and make sure we throw that one in throw_from_rcl_error. In rclpy, we should make sure to always raise KeybordInterrupt in those cases.

I think that with those changes, we will be able to handle an asynchronous shutdown gracefully.

This approach requires careful reviewing of all functions in rcl layer, to make sure they are returning the appropriate error (and careful handling of those errors in the upper client layer).

Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages.

ivanpauno avatar Jun 22 '20 14:06 ivanpauno

i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID? i was thinking RCL_RET_CONTEXT_INVALID is equal to RCL_RET_IS_SHUTDOWN. we should add specific state that tells us if shutdown is called or not? could you enlighten me a little bit?

fujitatomoya avatar Jun 23 '20 04:06 fujitatomoya

i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID?

Sorry, I think my comment wasn't clear.

Those are two names suggestions for the same new return code. RCL_RET_CONTEXT_INVALID is named similarly to the other return codes. RCL_RET_IS_SHUTDOWN is a more clear name for this case IMO.

ivanpauno avatar Jun 23 '20 14:06 ivanpauno

@ivanpauno

thanks for the clarification, we will take care of this.

fujitatomoya avatar Jun 24 '20 01:06 fujitatomoya

@ivanpauno

thanks for the clarification, we will take care of this.

Sorry, but up to now I'm the only one in favor of that change, we need to settle in what we want first. I would like to hear other opinions before starting a change that may be rejected.

@wjwwood @hidmic @jacobperron Does a custom return code in rcl and a custom exception in rclcpp for shutdown sound like a good idea to solve the issue? Do you have any other alternative?

ivanpauno avatar Jun 24 '20 14:06 ivanpauno

Introducing a new common rclcpp exception to make things consistent sounds like an incremental improvement to me +1

As a drastically different alternative, we could remove the exception and have a notification based approach by having spin return when it encounters an error (it could return a bool or int code). I don't know how technically feasible this is, e.g.

while (rclcpp::spin_some()) {
  // ...
}

And existing code written like the following should still work (but without exceptions):

while (rclcpp::ok()) {
  rclcpp::spin_some(...);
}

Of course, as @wjwwood points out, users can still make the error of writing something like

while (true) {
  rclcpp::spin_some(...);
}

But, we can warn them about not using the return code at compile time at least.

IMO, having spin return silently (or with a return code) on a shutdown would match user expectation more than throwing an exception. Especially, as our examples are written using rclcpp::ok().

jacobperron avatar Jun 24 '20 18:06 jacobperron

As a drastically different alternative, we could remove the exception and have a notification based approach by having spin return when it encounters an error (it could return a bool or int code). I don't know how technically feasible this is, e.g.

I think that the notification based approach is fine. It could even be configurable in InitOptions (I'm not sure if that's a good idea or not).

IMHO, changing spin* signatures won't help. Our loops currently look like:

while (rclcpp::ok()) {
  ...
  rclcpp::spin*(...);
  ...
}

That is perfectly suitable for a notification based approach, having a return value in spin* functions won't add much value. If spin* functions get notified and stop handling new work, at some point you will reach rclcpp::ok again. You can even have extra rclcpp::ok checks in the middle of the loop to break it if that's suitable to your application.

ivanpauno avatar Jun 24 '20 19:06 ivanpauno

We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.

Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().

This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls.

wjwwood avatar Jun 25 '20 17:06 wjwwood

Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().

This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls.

:+1: I like that option.

ivanpauno avatar Jun 25 '20 17:06 ivanpauno

SGTM

jacobperron avatar Jun 25 '20 22:06 jacobperron

i guess we got to the consensus.

fujitatomoya avatar Jun 28 '20 20:06 fujitatomoya

i guess we got to the consensus.

Yeap, I will check in tomorrow's meeting if everybody agrees.

Doing this change will involve coordinated work in rcl, rclpy and rclcpp to make sure that shutdown doesn't do more than triggering guard conditions once, invalidating the context, and calling shutdown callbacks. The only place where we should check if a context is valid or not is in rclcpp::ok(), rclpy.ok() (e.g. publish should continue working, etc).

Getting the refactor right might be a bit tricky, if you're planning to do I can help with the reviews.

ivanpauno avatar Jun 29 '20 20:06 ivanpauno

I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly? Atm the shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".

dirk-thomas avatar Jun 30 '20 17:06 dirk-thomas

shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".

i think that we can, and shutdown() call blocks with https://github.com/ros2/rclcpp/blob/a8cd93623964a089ee132d65e8b5eda84c15740d/rclcpp/include/rclcpp/context.hpp#L345

maybe i am misunderstanding, could you elaborate?

fujitatomoya avatar Jul 01 '20 00:07 fujitatomoya

I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly?

It should be safe to do it after rclcpp::ok() returns false.

Atm the shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".

We're currently finishing loggers at shutdown (see here), and I think that part of the idea is to stop doing that and doing it when the context is destructed. If we do that change, init should make sure to handle the problem correctly (it's possible with some extra lines here).

We could also provide a reinit method that takes no arguments. It will make sure that the context is valid again, and it uses the same init options passed before (i.e. it only switches a flag).

examples:

rclcpp::init(argc, argv, init_options);
while (rclcpp::ok()) {
  executor.spin_all(max_duration);
  ... handle more work here
}
rclcpp::reinit();
auto context = rclcpp::Context::make_shared();
context->init(argc, argv, init_options);
while (context->is_valid()) {
  executor.spin_all(max_duration);
  ... handle more work here
}
context->init(other_argc, other_argv, other_init_options);

We should make sure both examples above work.


If the user is trying to call rclcpp::Context::init from more than one thread, it will be a problem. But I don't think we should provide such thread safety guarantee.

ivanpauno avatar Jul 01 '20 14:07 ivanpauno

It should be safe to do it after rclcpp::ok() returns false.

It's not though... It's only safe to do that after the context is destructed right?

I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore.

wjwwood avatar Jul 01 '20 15:07 wjwwood

It's not though... It's only safe to do that after the context is destructed right?

I think it's currently safe, as shutdown is finishing the logger and init will be able to setup it again correctly (supposing only one context). If we move logging finishing to context destruction, we will need to handle that case in init.

I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore.

It should work if one context initializes the logging system and the other doesn't. I think it would be great to move the global context configuration to one per Context, but that doesn't seem to be exactly related (I will create an issue and add it to the Galactic project board).


I think we should add test for all these cases when doing the change, so it will be clear what works and what doesn't. Both cases can conceptually work correctly with the change in shutdown behavior.

ivanpauno avatar Jul 01 '20 16:07 ivanpauno

Although not required, it would be nice if the solution to this issue (or version of it) could be backported to Foxy.

jacobperron avatar Jul 10 '20 20:07 jacobperron

A few comments here and there:

If the context is shutdown in the middle of the call to rclcpp::ok and the construction of the executor in rclcpp::spin_some, then the error happens.

IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).

I see. It's a valid point. Perhaps spinning a entity associated with a shutdown context shouldn't throw an exception?

Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().

I guess the question is, what's the point of having a third shutdown state besides the usual init-fini cycle? If we only intend it to be a generic notification with no side-effects of its own, why making it a separate state? Why restricting it to a single event? Why do we call it shutdown?

As I see it, to shutdown means a context and all entities associated with it are soon to be finalized. That certainly bans entity creation and restricts the amount of work you can do (spinning is a good example of what should be allowed, or otherwise the added state is somewhat useless). And while we don't do anything meaningful about it right now, if anyone ever wants to:

  • Hint the system that an entity is on its way out
    • To avoid spurious service requests (way faster than a timeout).
    • To exclude them from discovery entirely (in DDS, or whatever transport protocol rmw uses)
  • Reschedule entity tasks to speed up termination.
  • Free/redistribute resource pools early

we'll need the concept.

I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly?

It should be safe to do it after rclcpp::ok() returns false.

It's not though... It's only safe to do that after the context is destructed right?

I think it's currently safe, as shutdown is finishing the logger and init will be able to setup it again correctly (supposing only one context).

I will say that's because rclcpp::Context::init silently destroys the underlying rcl context. Below rclcpp, it's init-shutdown-fini and only then a context can be reused.

Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages.

This is true because rcl uses an atomic flag to invalidate contexts, instead of relying on rmw which also has to track shutdown contexts (to comply with rmw_context_fini() API). Finalized and shutdown context states could be perfectly distinguishable.

hidmic avatar Jul 20 '20 20:07 hidmic

As I see it, to shutdown means a context and all entities associated with it are soon to be finalized.

That's what I imagined as well, but I think there's a need for another use case too, but whether or not that should be part of the same mechanism, I'm not sure. For example, we need a "you should stop, cleanup, and exit your node's main functions". That's what shutdown() meant in ROS 1 for the most part.

wjwwood avatar Jul 20 '20 21:07 wjwwood