Composable node runtime error (undefined symbol) but normal Node has no issue
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04
- Installation type:
- binaries
- Version or commit hash:
- humble
- DDS implementation:
- Client library (if applicable):
- rclcpp
I am trying to call the python interpreter from a ComposableNode.
I have no issue doing a simple print(), but if I try to do import torch, the program crashes with an undefined symbol error.
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): ImportError: /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so: undefined symbol: PyTuple_Type
At:
/usr/lib/python3.10/ctypes/__init__.py(8): <module>
<frozen importlib._bootstrap>(241): _call_with_frames_removed
<frozen importlib._bootstrap_external>(883): exec_module
<frozen importlib._bootstrap>(703): _load_unlocked
<frozen importlib._bootstrap>(1006): _find_and_load_unlocked
<frozen importlib._bootstrap>(1027): _find_and_load
/home/mclement/.local/lib/python3.10/site-packages/torch/__init__.py(17): <module>
<frozen importlib._bootstrap>(241): _call_with_frames_removed
<frozen importlib._bootstrap_external>(883): exec_module
<frozen importlib._bootstrap>(703): _load_unlocked
<frozen importlib._bootstrap>(1006): _find_and_load_unlocked
<frozen importlib._bootstrap>(1027): _find_and_load
There is no issue when doing this with a normal Node.
Steps to reproduce issue
I made a minimal example showcasing the issue in the following repository: https://github.com/maxime-clem/ros2_composable_node_bug
# requires python3-dev and pybind11-dev
git clone [email protected]:maxime-clem/ros2_composable_node_bug.git
cd ros2_composable_node_bug
colcon build
source install/setup.sh
ros2 run test_exe test_exe # No issue
ros2 run test_node test_node # No issue
ros2 run test_composable_node test_composable_node_exe # undefined symbol error
Expected behavior
Composable node can use the python library without issue, similarly to a normal Node.
Actual behavior
Composable node crashes with an undefined symbol: PyTuple_Type error.
Additional information
I have confirmed the issue with another user so it does not appear to be en environment issue.
The only workaround found so far is to use dlopen("libpython3.10.so", RTLD_GLOBAL | RTLD_NOW) in the code of the ComposableNode.
Interesting, because as far as I know, the composable node containers should be purely C++ and shouldn't have any interactions with pybind11. This happens on the first pybind11 node that you load or on subsequent ones?
The issue happens even without using a composable node containers (can be reproduced by directly running the node executable).
Since the issue can be solved by using dlopen, it seems to be a linker issue but I do not see any reason why the link to the python library would be different between a Node and a ComposableNode.
I'm not 100% sure of this, but the situation seems similar to https://github.com/PyO3/pyo3/issues/2000#issuecomment-979479111 , which leads to https://bugs.python.org/issue21536 . There, they discuss some of the ins and outs of loading things dynamically with Python. In particular, I'll point to this comment where they say:
"IHMO it's a bad usage of dlopen(): libpython must always be loaded with RTLD_GLOBAL."
I then took a look at how we loaded libraries, and saw this: https://github.com/ros2/rcutils/blob/d3fed35f2d8e19dede7f6dfd5f3b862c40ac7809/src/shared_library.c#L97
Indeed, locally if I switch that to RTLD_LAZY | RTLD_GLOBAL, the example that @maxime-clem provided works.
So the question is: should we add in RTLD_GLOBAL? It fixes the issue, but I'm slightly concerned about other side-effects it might have. @mjcarroll thoughts?