rclpy icon indicating copy to clipboard operation
rclpy copied to clipboard

Exception never retrieved when using MultiThreadedExecutor and Timers

Open mr337 opened this issue 2 years ago • 8 comments

Bug report

Required Info:

  • Operating System: Ubuntu 20.04.4
  • Installation type: binaries
  • Version or commit hash: foxy
  • DDS implementation: default
  • Client library (if applicable): rclpy

Steps to reproduce issue

I am encountering an issue with exceptions being thrown in timer callbacks using MultiThreadedExecutor going silent.

The sample shows that when a an exception is generated in a timer callback using the SingleThreadExecutor, implicitly I believe, the exception rises if uncaught, as expected.

When using the MultiThreadedExecutor the exception gets lost and silently fails, sometimes causing head scratching.

#!/usr/bin/env python3
import rclpy
from rclpy.executors import MultiThreadedExecutor
from rclpy.node import Node


class TestNode(Node):
    def __init__(self):
        super().__init__('zf9p')
        self.tmr = self.create_timer(1, self.tmr_callback)

    def tmr_callback(self):
        1/0
        print('Timer callback fired')


def single_executor():
    test_node = TestNode()

    while rclpy.ok() is True:
        rclpy.spin(test_node)

    test_node.destroy_timer(test_node.tmr)


def multi_executor():
    test_node = TestNode()

    # Moved to multi threaded executor so none of Actions block publishing of msgs
    executor = MultiThreadedExecutor(num_threads=1)
    executor.add_node(test_node)

    while rclpy.ok() is True:
        executor.spin()

    test_node.destroy_timer(test_node.tmr)


if __name__ == '__main__':
    rclpy.init()

    # single_executor()  # un comment to verify proper operation 
    multi_executor()

    rclpy.shutdown()

Expected behavior

Unhandled exception to rise to the top

$ python3 timer_example
Traceback (most recent call last):
  ...
  File "/opt/ros/foxy/lib/python3.8/site-packages/rclpy/executors.py", line 343, in _execute_timer
    await await_or_execute(tmr.callback)
  File "/opt/ros/foxy/lib/python3.8/site-packages/rclpy/executors.py", line 118, in await_or_execute
    return callback(*args)
  File "timer_example", line 13, in tmr_callback
    1/0
ZeroDivisionError: division by zero

Actual behavior

Exception seems to silently fail not handled, but after 10 or 30 seconds the following message does show.

$ python3 timer_example
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero

Additional information

This looks to be similar to https://github.com/ros2/rclpy/issues/296, but not for certain.

mr337 avatar Aug 03 '22 20:08 mr337

@mr337 thanks for the issue, i confirmed that this problem is not only for foxy but mainline source build with https://github.com/ros2/ros2/commit/5a35507c37f3b0449e31cb4073cf0352bfd52b0b.

  • SingleThreadedExecutor
# ros2 run prover_rclpy rclpy_983
Traceback (most recent call last):
  File "/root/ros2_ws/colcon_ws/install/prover_rclpy/lib/prover_rclpy/rclpy_983", line 33, in <module>
    sys.exit(load_entry_point('prover-rclpy', 'console_scripts', 'rclpy_983')())
  File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 41, in main
    single_executor()  # un comment to verify proper operation
  File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 20, in single_executor
    rclpy.spin(test_node)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 222, in spin
    executor.spin_once()
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 723, in spin_once
    raise handler.exception()
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
    self._handler.send(None)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 423, in handler
    await call_coroutine(entity, *arg)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 333, in _execute_timer
    await await_or_execute(tmr.callback)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
    return callback(*args)
  File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 12, in tmr_callback
    1/0
ZeroDivisionError: division by zero
[ros2run]: Process exited with failure 1
  • MultiThreadedExecutor
root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 run prover_rclpy rclpy_983
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
The following exception was never retrieved: division by zero
...<snip>

fujitatomoya avatar Aug 04 '22 04:08 fujitatomoya

with https://github.com/ros2/rclpy/pull/984, MultiThreadedExecutor can raise the user exception like SingleThreadedExecutor.

# ros2 run prover_rclpy rclpy_983
Traceback (most recent call last):
  File "/root/ros2_ws/colcon_ws/install/prover_rclpy/lib/prover_rclpy/rclpy_983", line 33, in <module>
    sys.exit(load_entry_point('prover-rclpy', 'console_scripts', 'rclpy_983')())
  File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 42, in main
    multi_executor()
  File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 33, in multi_executor
    executor.spin()
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 279, in spin
    self.spin_once()
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 772, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 769, in _spin_once_impl
    self._executor.submit(handler_wrapper(handler))
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 768, in handler_wrapper
    raise handler.exception()
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
    self._handler.send(None)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 423, in handler
    await call_coroutine(entity, *arg)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 333, in _execute_timer
    await await_or_execute(tmr.callback)
  File "/root/ros2_ws/colcon_ws/install/rclpy/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
    return callback(*args)
  File "/root/ros2_ws/colcon_ws/build/prover_rclpy/src/rclpy_983.py", line 12, in tmr_callback
    1/0
ZeroDivisionError: division by zero
[ros2run]: Process exited with failure 1

fujitatomoya avatar Aug 04 '22 05:08 fujitatomoya

@fujitatomoya Thanks for the response and PR. I'll comment on the PR as the reviewer has a few good questions.

mr337 avatar Aug 04 '22 19:08 mr337

With #984 now merged, I'm going to go ahead and close this. If you feel that this was in error, please feel free to reopen.

clalancette avatar Aug 18 '22 12:08 clalancette

I need to reopen this because of https://github.com/ros2/rclpy/pull/1017

fujitatomoya avatar Oct 06 '22 19:10 fujitatomoya

@fujitatomoya LMK if there is any way I can help!

mr337 avatar Oct 07 '22 02:10 mr337

@mr337 PR would be really appreciated.

Implementation Note: It should not submit the execution, cz it will block the submit method until the future is completed. This is because https://github.com/ros2/rclpy/issues/1016 happens. Instead, it should submit handler, and save the future returned by executor, then check if they are completed successfully or thrown exception. these returned futures should be managed w/o any block, probably each spin loop we can check the previous results to see if there is an exception, and if there is, raise that exception from the loop. at the same time, it should clear the completed result.

fujitatomoya avatar Oct 07 '22 16:10 fujitatomoya

I am not for sure I fully understand the problem :S But I can try to attempt a PR. :) I should have some available cycles next week to work on this.

mr337 avatar Oct 07 '22 16:10 mr337

@fujitatomoya or @mr337 want to check that this isn't lost as this is pretty critical functionality.

Achllle avatar Jan 03 '23 11:01 Achllle

@Achllle I am not for sure how to handle this. After seeing the breaking changes this made maybe the design is already set and changing it could be a bigger deal. The https://github.com/ros2/rclpy/pull/984 had a good discussion on what the default behavior should and shouldn't be. @fujitatomoya you was involved with that discussion, do you think this issue is still valid?

I am happy to close if not and just make a better effort to handle issues within the executor.

mr337 avatar Jan 03 '23 17:01 mr337

do you think this issue is still valid?

yes, i think this is a bug. uncaught exception should be raised to user application from executor. as i mentioned https://github.com/ros2/rclpy/issues/983#issuecomment-1271782711, this is doable, but cannot allocate bandwidth for that.

this isn't lost as this is pretty critical functionality.

this is not. thanks for pinging, please consider the contribution if possible.

fujitatomoya avatar Jan 03 '23 21:01 fujitatomoya

@Achllle thank you for the contribution, everything has been done including backport humble.

fujitatomoya avatar Feb 25 '23 17:02 fujitatomoya

Hi, I am currently using foxy (installed from apt) and I recently encountered the exact problem mentioned by the original author of this ticket. Is there a reason why the multithreaded executor would still be giving me problems like this?

tsender avatar Apr 21 '23 14:04 tsender

Hi, I am currently using foxy (installed from apt) and I recently encountered the exact problem mentioned by the original author of this ticket. Is there a reason why the multithreaded executor would still be giving me problems like this?

Not a deep investigation but it doesn't seem like the fix was backported to foxy, only humble

tonynajjar avatar Apr 21 '23 15:04 tonynajjar

Aargh. I find it really annoying when fixes are not backpropagated to versions that are still highly in use. I'll just fork the repo and make the change there for the time being.

tsender avatar Apr 21 '23 15:04 tsender

@fujitatomoya Since Foxy is still actively supported for another month, could we consider backporting that distribution too?

In the meantime, while not ideal, you could monkeypatch the function as follows in your code:

# Monkeypatch MultiThreadedExecutor._spin_once_impl
from typing import Callable
from rclpy.executors import ExternalShutdownException, ShutdownException, TimeoutException, ConditionReachedException
def _spin_once_impl(
    self,
    timeout_sec: float = None,
    wait_condition: Callable[[], bool] = lambda: False
) -> None:
    try:
        self.futures
    except AttributeError:
        self._futures = []
    try:
        handler, entity, node = self.wait_for_ready_callbacks(
            timeout_sec, None, wait_condition)
    except ExternalShutdownException:
        pass
    except ShutdownException:
        pass
    except TimeoutException:
        pass
    except ConditionReachedException:
        pass
    else:
        self._executor.submit(handler)
        self._futures.append(handler)
        for future in self._futures:  # check for any exceptions
            if future.done():
                self._futures.remove(future)
                future.result()

MultiThreadedExecutor._spin_once_impl = _spin_once_impl

Achllle avatar Apr 21 '23 15:04 Achllle

I actually just forked the rclpy repo and made the change in the foxy branch, then had colcon allow overriding when building. But thanks for the other alternative. Anyway, if foxy is only supported for another month, that means you still support galactic for a while too. I would recommend updating that version too.

tsender avatar Apr 21 '23 15:04 tsender

if foxy is only supported for another month, that means you still support galactic for a while too.

That's not the case given Galactic is not an LTS like Foxy.

Achllle avatar Apr 21 '23 18:04 Achllle

@fujitatomoya Since Foxy is still actively supported for another month, could we consider backporting that distribution too?

:+1: so that it will be available for last patch.

That's not the case given Galactic is not an LTS like Foxy.

right, Galactic is already E.O.L.

fujitatomoya avatar Apr 21 '23 18:04 fujitatomoya

@tsender @Achllle foxy backport, https://github.com/ros2/rclpy/pull/1116

if you would verify and review, that would be appreciated!

fujitatomoya avatar Apr 21 '23 18:04 fujitatomoya

When I made the exact same changes to my own fork, I did not see any problems and it seemed to function as expected. And my mistake about galactic, I forgot only certain releases were LTS.

tsender avatar Apr 21 '23 20:04 tsender

@tsender thanks for the test!

fujitatomoya avatar Apr 21 '23 20:04 fujitatomoya

As of #1073 this is mostly fixed but there's still a problem, which is that the exception is only raised if another callback is ready to execute. In the case of a node that only provides services, this means an exception in a service won't be raised until someone makes another service call, which may not happen for a while.

This is mostly because wait_for_ready_callbacks blocks indefinitely and so the code to check future statuses doesn't get to run. But also you can't get around it by calling spin_once in a loop because the future checking doesn't run if a TimeoutException gets raised either

jdlangs avatar Jun 21 '23 19:06 jdlangs

@jdlangs good find. Do you have a suggestion for a solution that fixes this case?

Achllle avatar Jun 23 '23 23:06 Achllle

@Achllle I think simply moving the future result check to before the wait should do it if I'm not mistaken. If you agree I can make a quick PR.

Edit: I guess you also need to let wait_for_ready_callbacks timeout though. I don't see a way to make it work if it can block indefinitely.

jdlangs avatar Aug 15 '23 21:08 jdlangs

@jdlangs thank you very much for your contribution. I would suggest that you can create PR against rolling 1st, then we can backport them to Iron and Humble.

fujitatomoya avatar Aug 15 '23 22:08 fujitatomoya