rclcpp
rclcpp copied to clipboard
Asynchronously shutting down while using `rclcpp::spin` `rclcpp:spin_some` in another thread throws an exception
This nightly failure is an example of the problem:
- https://ci.ros2.org/view/nightly/job/nightly_osx_repeated/1968/testReport/junit/(root)/projectroot/test_lifecycle_service_client/
What is happening is that one thread is creating an executor, while other is trying to shutdown.
If shutdown is called before the creation of the Executor
, an exception is thrown:
https://github.com/ros2/rclcpp/blob/87bb9f9758dce239d37256ae91d58ce6325f405b/rclcpp_lifecycle/test/test_lifecycle_service_client.cpp#L218-L231
@hidmic @wjwwood @jacobperron does this look like an user error (i.e. the test is not well written) or should be improve how shutdown works?
I'd say user error.
Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid?
Seems like user error; calling shutdown before spin. What's an alternative, have spin return if the context is invalid?
The thing is that spin_some
is called in a loop, "protected" in a rclcpp::ok
check.
If the context is shutdown in the middle of the call to rclcpp::ok
and the construction of the executor in rclcpp::spin_some
, then the error happens.
IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok
, there should be no problem).
IMO shutdown shouldn't affect executors/publishers/subscriptions/nodes/... etc, it should only make rclcpp::ok()
return false and wake up executors.
In that way, we could shutdown asynchronously without this error (or other bunch of possible of similar errors).
I also would classify it as "user error", but not necessarily the user's fault. It's confusing and so the existing story could be improved.
This has been discussed in the past:
- https://github.com/ros2/rclcpp/issues/812
- https://github.com/ros2/rclcpp/issues/610
- https://github.com/ros2/ros2/issues/369#issuecomment-518779552
And perhaps others, though I couldn't find them. I know @sloretz and I spoke about it in person at length around the time https://github.com/ros2/rmw/pull/154 was merged.
And this might be a duplicate?
- https://github.com/ros2/rclcpp/issues/1111
Originally I thought about shutdown like KeyboardInterrupt in Python, but since this is C++ and the user isn't always inside on of our functions we cannot ensure it is raised everywhere. Instead we have (some of ) our functions raise and then the user can check rclcpp::ok()
to know when to interrupt their own work.
The thinking was that if the user wrote something like this:
while (true) {
rclcpp::spin_some(...);
}
Then they wouldn't get stuck in a loop due the mistake (imo) of not checking rclcpp::ok()
. I think this fail fast approach will result in the fewest unexplained hangs for users.
However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example.
The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.
IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).
This is a bad assumption, imo. It's like acquiring a lock, releasing it, and then assuming afterwards the resource is safe. I understand how it's an easy mistake to make, but imo it is a mistake. rclcpp::ok()
just checks a state, it doesn't hold a resource or block other actions, and it doesn't indicate that to be the case.
Transferred the issue to rclcpp
, as I opened it in rmw
by mistake.
This has been discussed in the past:
- https://github.com/ros2/rclcpp/issues/812
- https://github.com/ros2/rclcpp/issues/610
- https://github.com/ros2/ros2/issues/369#issuecomment-518779552
And this might be a duplicate?
- https://github.com/ros2/rclcpp/issues/1111
Thanks for pointing out the duplicates. I will try to consolidate all of them in a new issue --or edit an old one--, and close the others.
Then they wouldn't get stuck in a loop due the mistake (imo) of not checking rclcpp::ok(). I think this fail fast approach will result in the fewest unexplained hangs for users.
However, that means to avoid tracebacks and crashes they need to catch and handle these errors (like you do with KeyboardInterrupt in Python), and it's also hard to know which things will work and which will not. For example, publishing will work, and destroying a publisher will work, but making a new publisher will not. This was essentially a compromise to enable things like "during shutdown I might want to send a message (publish) to notify someone", for example.
The rules not being clear on this is something that Apex folks struggled with, and it's clearly been an issue for us and our users. So I'd be open to reevaluating this decision. We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.
I think that both "exception like" and "notification like" behaviors are acceptable and have their pro/cons. But IMO, we're doing neither of both correctly.
If we prefer throwing an exception when the user tries "doing something" after shutdown, I think we should everywhere throw the same exception.
e.g.: In the case I pointed out, we're just throwing an RCLError
with a custom message, how do I distinguish that from an actual error?
I also think that we should add the try/catch
in our examples/demos.
@ivanpauno
I also think that we should add the try/catch in our examples/demos.
Just FYI, I think that I already did it with https://github.com/ros2/examples/pull/270.
@ivanpauno
I also think that we should add the try/catch in our examples/demos.
Just FYI, I think that I already did it with ros2/examples#270.
That's exactly what I don't like of how shutdown
works right now, you only can try/catch
a generic RCLError
. There should be a rclcpp::exceptions::Shutdown
error (or similar).
We already have in rcl
return values like RCL_RET_NODE_INVALID
, RCL_RET_PUBLISHER_INVALID
, etc.
After shutdown, you usually get one of those errors (e.g. https://github.com/ros2/rcl/blob/069d1f07290019b62a8297c5f95a80b58e63682a/rcl/src/rcl/publisher.c#L286).
Can we add a RCL_RET_IS_SHUTDOWN
or RCL_RET_CONTEXT_INVALID
in rcl
and return that when appropriate?
We could then add in rclcpp a ShutdownError
derived from RCLError and make sure we throw that one in throw_from_rcl_error. In rclpy, we should make sure to always raise KeybordInterrupt in those cases.
I think that with those changes, we will be able to handle an asynchronous shutdown gracefully.
This approach requires careful reviewing of all functions in rcl
layer, to make sure they are returning the appropriate error (and careful handling of those errors in the upper client layer).
Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages.
i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID? i was thinking RCL_RET_CONTEXT_INVALID is equal to RCL_RET_IS_SHUTDOWN. we should add specific state that tells us if shutdown is called or not? could you enlighten me a little bit?
i think that is feasible. one question, what is the difference between RCL_RET_IS_SHUTDOWN and RCL_RET_CONTEXT_INVALID?
Sorry, I think my comment wasn't clear.
Those are two names suggestions for the same new return code.
RCL_RET_CONTEXT_INVALID
is named similarly to the other return codes.
RCL_RET_IS_SHUTDOWN
is a more clear name for this case IMO.
@ivanpauno
thanks for the clarification, we will take care of this.
@ivanpauno
thanks for the clarification, we will take care of this.
Sorry, but up to now I'm the only one in favor of that change, we need to settle in what we want first. I would like to hear other opinions before starting a change that may be rejected.
@wjwwood @hidmic @jacobperron Does a custom return code in rcl and a custom exception in rclcpp for shutdown sound like a good idea to solve the issue? Do you have any other alternative?
Introducing a new common rclcpp exception to make things consistent sounds like an incremental improvement to me +1
As a drastically different alternative, we could remove the exception and have a notification based approach by having spin
return when it encounters an error (it could return a bool or int code). I don't know how technically feasible this is, e.g.
while (rclcpp::spin_some()) {
// ...
}
And existing code written like the following should still work (but without exceptions):
while (rclcpp::ok()) {
rclcpp::spin_some(...);
}
Of course, as @wjwwood points out, users can still make the error of writing something like
while (true) {
rclcpp::spin_some(...);
}
But, we can warn them about not using the return code at compile time at least.
IMO, having spin
return silently (or with a return code) on a shutdown would match user expectation more than throwing an exception. Especially, as our examples are written using rclcpp::ok()
.
As a drastically different alternative, we could remove the exception and have a notification based approach by having spin return when it encounters an error (it could return a bool or int code). I don't know how technically feasible this is, e.g.
I think that the notification based approach is fine.
It could even be configurable in InitOptions
(I'm not sure if that's a good idea or not).
IMHO, changing spin*
signatures won't help. Our loops currently look like:
while (rclcpp::ok()) {
...
rclcpp::spin*(...);
...
}
That is perfectly suitable for a notification based approach, having a return value in spin*
functions won't add much value.
If spin*
functions get notified and stop handling new work, at some point you will reach rclcpp::ok
again.
You can even have extra rclcpp::ok
checks in the middle of the loop to break it if that's suitable to your application.
We could instead say that shutdown changes nothing but a flag in the context, and everything continues as normal and it's up to the user to notice and react to it.
Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()
/rclcpp::ok()
.
This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls.
Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of rclcpp::Context::ok()/rclcpp::ok().
This runs the risk of the user not noticing shutdown, but we can't keep them from all pitfalls.
:+1: I like that option.
SGTM
i guess we got to the consensus.
i guess we got to the consensus.
Yeap, I will check in tomorrow's meeting if everybody agrees.
Doing this change will involve coordinated work in rcl
, rclpy
and rclcpp
to make sure that shutdown doesn't do more than triggering guard conditions once, invalidating the context, and calling shutdown callbacks. The only place where we should check if a context is valid or not is in rclcpp::ok()
, rclpy.ok()
(e.g. publish should continue working, etc).
Getting the refactor right might be a bit tricky, if you're planning to do I can help with the reviews.
I think we should consider and address the bigger picture: how should a user perform multiple init
/ shutdown
cycles cleanly? Atm the shutdown()
call isn't blocking and you can't call init()
right afterwards without being subject to a race or warnings like "logging already initialized".
shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".
i think that we can, and shutdown()
call blocks with
https://github.com/ros2/rclcpp/blob/a8cd93623964a089ee132d65e8b5eda84c15740d/rclcpp/include/rclcpp/context.hpp#L345
maybe i am misunderstanding, could you elaborate?
I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly?
It should be safe to do it after rclcpp::ok()
returns false.
Atm the shutdown() call isn't blocking and you can't call init() right afterwards without being subject to a race or warnings like "logging already initialized".
We're currently finishing loggers at shutdown (see here), and I think that part of the idea is to stop doing that and doing it when the context is destructed.
If we do that change, init
should make sure to handle the problem correctly (it's possible with some extra lines here).
We could also provide a reinit
method that takes no arguments. It will make sure that the context is valid again, and it uses the same init options passed before (i.e. it only switches a flag).
examples:
rclcpp::init(argc, argv, init_options);
while (rclcpp::ok()) {
executor.spin_all(max_duration);
... handle more work here
}
rclcpp::reinit();
auto context = rclcpp::Context::make_shared();
context->init(argc, argv, init_options);
while (context->is_valid()) {
executor.spin_all(max_duration);
... handle more work here
}
context->init(other_argc, other_argv, other_init_options);
We should make sure both examples above work.
If the user is trying to call rclcpp::Context::init
from more than one thread, it will be a problem.
But I don't think we should provide such thread safety guarantee.
It should be safe to do it after rclcpp::ok() returns false.
It's not though... It's only safe to do that after the context is destructed right?
I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore.
It's not though... It's only safe to do that after the context is destructed right?
I think it's currently safe, as shutdown
is finishing the logger and init will be able to setup it again correctly (supposing only one context).
If we move logging finishing to context destruction, we will need to handle that case in init
.
I haven't tried it recently, but you should be able to init/shutdown contexts in parallel as long as you're using different ones each time. Maybe the global logging setup doesn't make that work anymore.
It should work if one context initializes the logging system and the other doesn't. I think it would be great to move the global context configuration to one per Context, but that doesn't seem to be exactly related (I will create an issue and add it to the Galactic project board).
I think we should add test for all these cases when doing the change, so it will be clear what works and what doesn't. Both cases can conceptually work correctly with the change in shutdown behavior.
Although not required, it would be nice if the solution to this issue (or version of it) could be backported to Foxy.
A few comments here and there:
If the context is shutdown in the middle of the call to rclcpp::ok and the construction of the executor in rclcpp::spin_some, then the error happens.
IMO, it's a behavior error, and not an user one (I would expect that if I do something after checking for rclcpp::ok, there should be no problem).
I see. It's a valid point. Perhaps spinning a entity associated with a shutdown context shouldn't throw an exception?
Quoting myself, I think this might be the right path to take. Just make it so that shutdown does not affect the function of anything, it simply changes the state of
rclcpp::Context::ok()
/rclcpp::ok()
.
I guess the question is, what's the point of having a third shutdown state besides the usual init-fini cycle? If we only intend it to be a generic notification with no side-effects of its own, why making it a separate state? Why restricting it to a single event? Why do we call it shutdown
?
As I see it, to shutdown means a context and all entities associated with it are soon to be finalized. That certainly bans entity creation and restricts the amount of work you can do (spinning is a good example of what should be allowed, or otherwise the added state is somewhat useless). And while we don't do anything meaningful about it right now, if anyone ever wants to:
- Hint the system that an entity is on its way out
- To avoid spurious service requests (way faster than a timeout).
- To exclude them from discovery entirely (in DDS, or whatever transport protocol
rmw
uses)
- Reschedule entity tasks to speed up termination.
- Free/redistribute resource pools early
we'll need the concept.
I think we should consider and address the bigger picture: how should a user perform multiple init / shutdown cycles cleanly?
It should be safe to do it after rclcpp::ok() returns false.
It's not though... It's only safe to do that after the context is destructed right?
I think it's currently safe, as shutdown is finishing the logger and init will be able to setup it again correctly (supposing only one context).
I will say that's because rclcpp::Context::init
silently destroys the underlying rcl
context. Below rclcpp
, it's init-shutdown-fini and only then a context can be reused.
Note: It's impossible to recognize an already shutdown context of one that was never init, that should be clear in error messages.
This is true because rcl
uses an atomic flag to invalidate contexts, instead of relying on rmw
which also has to track shutdown contexts (to comply with rmw_context_fini()
API). Finalized and shutdown context states could be perfectly distinguishable.
As I see it, to shutdown means a context and all entities associated with it are soon to be finalized.
That's what I imagined as well, but I think there's a need for another use case too, but whether or not that should be part of the same mechanism, I'm not sure. For example, we need a "you should stop, cleanup, and exit your node's main functions". That's what shutdown()
meant in ROS 1 for the most part.