rclpy icon indicating copy to clipboard operation
rclpy copied to clipboard

SystemError `PyCFunction with class but no METH_METHOD` during normal ROS operations

Open apockill opened this issue 10 months ago • 2 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 running Docker (also 22.04)
  • Installation type:
    • Docker, ROS installed from apt. Similar env to https://github.com/UrbanMachine/create-ros-app
  • DDS implementation:
    • cyclonedds
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

WIP. Need some help isolating why this might be happening so I can create reproducible example. This issue is to get the conversation started, and see if others have seen something similar in the wild.


Expected behavior

No exception

Actual behavior

Occasional failures. Here are 2 examples of Tracebacks we've had happen on our robots, in the wild:

Example 1

Traceback (most recent call last):
  File "/robot/install/hardware/lib/hardware/big_bird_4_0_capcom", line 8, in <module>
    sys.exit(main())
  File "/robot/install/node_helpers/lib/python3.10/site-packages/node_helpers/spinning/initialization.py", line 68, in spin_fn
    rclpy.spin(node, executor)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 226, in spin
    executor.spin_once()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 775, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 711, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 517, in _wait_for_ready_callbacks
    waitables.extend(filter(self.can_execute, node.waitables))
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 457, in can_execute
    return not entity._executor_event and entity.callback_group.can_execute(entity)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/callback_groups.py", line 102, in can_execute
    with self._lock:
SystemError: attempting to create PyCFunction with class but no METH_METHOD flag

Example 2: In this case, it happened when calling is_cancel_requested property on a ServerGoalHandle

	
Traceback (most recent call last):
  File "/robot/install/node_helpers/lib/python3.10/site-packages/node_helpers/robust_rpc/_wrappers.py", line 72, in wrapper
    return callback(goal)
  File "/robot/install/node_helpers/lib/python3.10/site-packages/node_helpers/actions/server/base_handler.py", line 91, in _on_goal_with_report
    result = self.on_goal(goal_handle)
  File "/robot/install/node_helpers/lib/python3.10/site-packages/node_helpers/actions/server/fail_fast_handler.py", line 54, in on_goal
    return super().on_goal(goal_handle)
  File "/robot/install/node_helpers/lib/python3.10/site-packages/node_helpers/actions/server/base_handler.py", line 71, in on_goal
    return worker.execute_callback(self._request_timeout)
  File "/robot/install/node_helpers/lib/python3.10/site-packages/node_helpers/actions/server/worker.py", line 82, in execute_callback
    while timeout_obj and not self.goal_handle.is_cancel_requested:
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 98, in is_cancel_requested
    return GoalStatus.STATUS_CANCELING == self.status
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 102, in status
    with self._lock:
SystemError: attempting to create PyCFunction with class but no METH_METHOD flag

Additional information

This happens a few times a day, on a single machine running ros2 and ~25-50 nodes.

apockill avatar Jan 15 '25 22:01 apockill

This one is going to be difficult to diagnose without more information or a reproduction.

I think in the near term, there are two options we can pursue.

  1. Try to build humble with address sanitizer turned on. This can have performance impact, so if you are using it in a production system, this may not make sense.

  2. Enable collection coredumps, so that we can at least get a backtrace of what was going on around the point that the systemerror occurs. For a simple shell, you can enable coredumps with ulimit -c unlimited, but you may need a more detailed setup if you have limited permissions on your system.

mjcarroll avatar Jan 23 '25 19:01 mjcarroll

Adding a bit of information here- I noticed this issue happens most frequently if you're calling is_cancel_requested on the ServerGoalHandle at a fast clip. I added a sleep(0.001) in a fast loop and noticed the problem stopped happening.

I'll keep adding details here as I get more signal on the root cause.

apockill avatar Feb 13 '25 18:02 apockill