rclpy
rclpy copied to clipboard
Exception never retrieved when using MultiThreadedExecutor and Timers
Bug report
Required Info:
- Operating System: Ubuntu 20.04.4
- Installation type: binaries
- Version or commit hash: foxy
- DDS implementation: default
- Client library (if applicable): rclpy
Steps to reproduce issue
I am encountering an issue with exceptions being thrown in timer callbacks using MultiThreadedExecutor
going silent.
The sample shows that when a an exception is generated in a timer callback using the SingleThreadExecutor
, implicitly I believe, the exception rises if uncaught, as expected.
When using the MultiThreadedExecutor
the exception gets lost and silently fails, sometimes causing head scratching.
#!/usr/bin/env python3
import rclpy
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node
class TestNode(Node):
def __init__(self):
super().__init__('zf9p')
self.tmr = self.create_timer(1, self.tmr_callback)
def tmr_callback(self):
1/0
print('Timer callback fired')
def single_executor():
test_node = TestNode()
while rclpy.ok() is True:
rclpy.spin(test_node)
test_node.destroy_timer(test_node.tmr)
def multi_executor():
test_node = TestNode()
# Moved to multi threaded executor so none of Actions block publishing of msgs
executor = MultiThreadedExecutor(num_threads=1)
executor.add_node(test_node)
while rclpy.ok() is True:
executor.spin()
test_node.destroy_timer(test_node.tmr)
if __name__ == '__main__':
rclpy.init()
# single_executor() # un comment to verify proper operation
multi_executor()
rclpy.shutdown()
Expected behavior
Unhandled exception to rise to the top
$ python3 timer_example
Traceback (most recent call last):
...
File "/opt/ros/foxy/lib/python3.8/site-packages/rclpy/executors.py", line 343, in _execute_timer
await await_or_execute(tmr.callback)
File "/opt/ros/foxy/lib/python3.8/site-packages/rclpy/executors.py", line 118, in await_or_execute
return callback(*args)
File "timer_example", line 13, in tmr_callback
1/0
ZeroDivisionError: division by zero
Actual behavior
Exception seems to silently fail not handled, but after 10 or 30 seconds the following message does show.
$ python3 timer_example
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
Additional information
This looks to be similar to https://github.com/ros2/rclpy/issues/296, but not for certain.
@mr337 thanks for the issue, i confirmed that this problem is not only for foxy but mainline source build with https://github.com/ros2/ros2/commit/5a35507c37f3b0449e31cb4073cf0352bfd52b0b.
- SingleThreadedExecutor
# ros2 run prover_rclpy rclpy_983
Traceback (most recent call last):
File "/root/ros2_ws/colcon_ws/install/prover_rclpy/lib/prover_rclpy/rclpy_983", line 33, in <module>
sys.exit(load_entry_point('prover-rclpy', 'console_scripts', 'rclpy_983')())
File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 41, in main
single_executor() # un comment to verify proper operation
File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 20, in single_executor
rclpy.spin(test_node)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 222, in spin
executor.spin_once()
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 723, in spin_once
raise handler.exception()
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
self._handler.send(None)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 423, in handler
await call_coroutine(entity, *arg)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 333, in _execute_timer
await await_or_execute(tmr.callback)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
return callback(*args)
File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 12, in tmr_callback
1/0
ZeroDivisionError: division by zero
[ros2run]: Process exited with failure 1
- MultiThreadedExecutor
root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 run prover_rclpy rclpy_983
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
...<snip>
with https://github.com/ros2/rclpy/pull/984, MultiThreadedExecutor
can raise the user exception like SingleThreadedExecutor
.
# ros2 run prover_rclpy rclpy_983
Traceback (most recent call last):
File "/root/ros2_ws/colcon_ws/install/prover_rclpy/lib/prover_rclpy/rclpy_983", line 33, in <module>
sys.exit(load_entry_point('prover-rclpy', 'console_scripts', 'rclpy_983')())
File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 42, in main
multi_executor()
File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 33, in multi_executor
executor.spin()
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 279, in spin
self.spin_once()
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 772, in spin_once
self._spin_once_impl(timeout_sec)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 769, in _spin_once_impl
self._executor.submit(handler_wrapper(handler))
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 768, in handler_wrapper
raise handler.exception()
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
self._handler.send(None)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 423, in handler
await call_coroutine(entity, *arg)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 333, in _execute_timer
await await_or_execute(tmr.callback)
File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
return callback(*args)
File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 12, in tmr_callback
1/0
ZeroDivisionError: division by zero
[ros2run]: Process exited with failure 1
@fujitatomoya Thanks for the response and PR. I'll comment on the PR as the reviewer has a few good questions.
With #984 now merged, I'm going to go ahead and close this. If you feel that this was in error, please feel free to reopen.
I need to reopen this because of https://github.com/ros2/rclpy/pull/1017
@fujitatomoya LMK if there is any way I can help!
@mr337 PR would be really appreciated.
Implementation Note:
It should not submit the execution, cz it will block the submit
method until the future is completed. This is because https://github.com/ros2/rclpy/issues/1016 happens.
Instead, it should submit handler, and save the future returned by executor, then check if they are completed successfully or thrown exception. these returned futures should be managed w/o any block, probably each spin loop we can check the previous results to see if there is an exception, and if there is, raise that exception from the loop. at the same time, it should clear the completed result.
I am not for sure I fully understand the problem :S But I can try to attempt a PR. :) I should have some available cycles next week to work on this.
@fujitatomoya or @mr337 want to check that this isn't lost as this is pretty critical functionality.
@Achllle I am not for sure how to handle this. After seeing the breaking changes this made maybe the design is already set and changing it could be a bigger deal. The https://github.com/ros2/rclpy/pull/984 had a good discussion on what the default behavior should and shouldn't be. @fujitatomoya you was involved with that discussion, do you think this issue is still valid?
I am happy to close if not and just make a better effort to handle issues within the executor.
do you think this issue is still valid?
yes, i think this is a bug. uncaught exception should be raised to user application from executor. as i mentioned https://github.com/ros2/rclpy/issues/983#issuecomment-1271782711, this is doable, but cannot allocate bandwidth for that.
this isn't lost as this is pretty critical functionality.
this is not. thanks for pinging, please consider the contribution if possible.
@Achllle thank you for the contribution, everything has been done including backport humble.
Hi, I am currently using foxy (installed from apt) and I recently encountered the exact problem mentioned by the original author of this ticket. Is there a reason why the multithreaded executor would still be giving me problems like this?
Hi, I am currently using foxy (installed from apt) and I recently encountered the exact problem mentioned by the original author of this ticket. Is there a reason why the multithreaded executor would still be giving me problems like this?
Not a deep investigation but it doesn't seem like the fix was backported to foxy, only humble
Aargh. I find it really annoying when fixes are not backpropagated to versions that are still highly in use. I'll just fork the repo and make the change there for the time being.
@fujitatomoya Since Foxy is still actively supported for another month, could we consider backporting that distribution too?
In the meantime, while not ideal, you could monkeypatch the function as follows in your code:
# Monkeypatch MultiThreadedExecutor._spin_once_impl
from typing import Callable
from rclpy.executors import ExternalShutdownException, ShutdownException, TimeoutException, ConditionReachedException
def _spin_once_impl(
self,
timeout_sec: float = None,
wait_condition: Callable[[], bool] = lambda: False
) -> None:
try:
self.futures
except AttributeError:
self._futures = []
try:
handler, entity, node = self.wait_for_ready_callbacks(
timeout_sec, None, wait_condition)
except ExternalShutdownException:
pass
except ShutdownException:
pass
except TimeoutException:
pass
except ConditionReachedException:
pass
else:
self._executor.submit(handler)
self._futures.append(handler)
for future in self._futures: # check for any exceptions
if future.done():
self._futures.remove(future)
future.result()
MultiThreadedExecutor._spin_once_impl = _spin_once_impl
I actually just forked the rclpy repo and made the change in the foxy branch, then had colcon allow overriding when building. But thanks for the other alternative. Anyway, if foxy is only supported for another month, that means you still support galactic for a while too. I would recommend updating that version too.
if foxy is only supported for another month, that means you still support galactic for a while too.
That's not the case given Galactic is not an LTS like Foxy.
@fujitatomoya Since Foxy is still actively supported for another month, could we consider backporting that distribution too?
:+1: so that it will be available for last patch.
That's not the case given Galactic is not an LTS like Foxy.
right, Galactic is already E.O.L.
@tsender @Achllle foxy backport, https://github.com/ros2/rclpy/pull/1116
if you would verify and review, that would be appreciated!
When I made the exact same changes to my own fork, I did not see any problems and it seemed to function as expected. And my mistake about galactic, I forgot only certain releases were LTS.
@tsender thanks for the test!
As of #1073 this is mostly fixed but there's still a problem, which is that the exception is only raised if another callback is ready to execute. In the case of a node that only provides services, this means an exception in a service won't be raised until someone makes another service call, which may not happen for a while.
This is mostly because wait_for_ready_callbacks
blocks indefinitely and so the code to check future statuses doesn't get to run. But also you can't get around it by calling spin_once
in a loop because the future checking doesn't run if a TimeoutException
gets raised either
@jdlangs good find. Do you have a suggestion for a solution that fixes this case?
@Achllle I think simply moving the future result check to before the wait should do it if I'm not mistaken. If you agree I can make a quick PR.
Edit: I guess you also need to let wait_for_ready_callbacks
timeout though. I don't see a way to make it work if it can block indefinitely.
@jdlangs thank you very much for your contribution. I would suggest that you can create PR against rolling 1st, then we can backport them to Iron and Humble.