opentelemetry-python
opentelemetry-python copied to clipboard
OtelBatchSpanProcessor threading KeyError
Environment:
ubuntu:22.04
python:3.10
libs installed:
...
opentelemetry-api==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-exporter-gcp-trace==1.6.0
...
app.py
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
tracer_provider = TracerProvider()
trace.set_tracer_provider(tracer_provider)
gcp_exporter = CloudTraceSpanExporter()
span_processor = BatchSpanProcessor(gcp_exporter)
tracer_provider.add_span_processor(span_processor)
running with:
gunicorn --worker-class geventwebsocket.gunicorn.workers.GeventWebSocketWorker -w 1 --threads 20 app:app
I'm running my service on gcloud app engine with this configuration but many times when I create a new span I get the following exception and the service is restarted (i'm using supervisord)
Traceback (most recent call last): File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.10/threading.py", line 1008, in _bootstrap_inner
del _limbo[self]
KeyError: <Thread(OtelBatchSpanProcessor, stopped daemon 139781863898496)>
2024-04-26T18:20:34Z <Greenlet at 0x7f218057fd80: <bound method Thread._bootstrap of <Thread(OtelBatchSpanProcessor, stopped daemon 139781863898496)>>> failed with KeyError
[2024-04-26 18:20:35 +0000] [8] [ERROR] Worker (pid:12) was sent code 139!
2024-04-26 18:20:35,177 INFO reaped unknown pid 164 (exit status 0)
2024-04-26 18:20:35,179 INFO reaped unknown pid 147 (exit status 0)
2024-04-26 18:20:35,179 INFO reaped unknown pid 173 (exit status 0)
do you know how I can address this issue? As you can see I'm using gevent==23.9.1
If you use gevent you should monkey patch everything you use, and I don't see it from your code
sorry I didn't include it, yes i'm patching everything at the very start of my script
from gevent import monkey
monkey.patch_all()
Observed similar issue recently after switching to BatchSpanProcessor
Also we did monkey patch as well in our code
Issue: Conflict Between BatchSpanProcessor and Gevent in Service
Environment:
- Gevent: 23.9.1
- Python: 3.9.19
- OS: Rocky Linux 8.10 (Green Obsidian)
- OpenTelemetry: 1.27.0
Service Overview:
The service is complex, utilizing the _thread module to create new threads and the subprocess module to manage multiple subprocesses (especially for SSH clients). Additionally, the service uses gevent.monkey.patch_all() at the very beginning to enable cooperative multitasking. Note: the service frequently create subprocess witg module subprocess all the time to do some None python or gevent related logic.
Errors Encountered:
1. Error During BatchSpanProcessor Initialization:
When initializing BatchSpanProcessor, the following error is observed:
Traceback (most recent call last):
File "src/gevent/_abstract_linkable.py", line 287, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
File "src/gevent/_abstract_linkable.py", line 333, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
AssertionError: (None, <callback at 0x7f522d2748c0 args=([],)>)
2. Error After Service Runs for Some Time(within 30 secs):
After the service runs for a while, the following error appears multiple times with different daemon thread IDs:
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
File "/usr/lib64/python3.9/threading.py", line 937, in _bootstrap
self._bootstrap_inner()
File "/usr/lib64/python3.9/threading.py", line 972, in _bootstrap_inner
del _limbo[self]
KeyError: <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>
2024-11-15T22:20:58Z <Greenlet at 0x7f7024090400: <bound method Thread._bootstrap of <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>>> failed with KeyError
These errors affect the other threads running in the service and eventually cause the service to enter an unhealthy state, with some hosts becoming unreachable.
3.Investigation Process:
As we mentioned above, the problem we find is when we initialize BatchSpanProcessor in service main process. We found that os.register_at_fork(after_in_child=self._at_fork_reinit) may be the issue. By adding a log in a function test_fork()
def test_fork(self):
log.INFO("Otel in fork: PID=%s" %os.getpid())
stack = inspect.stack()
log.INFO("Call stack:\n" + "\n".join(["{%s}:{%s} in {%s}" %(frame.filename, frame.lineno, frame.function)for frame in stack]))
By adding a log in a function (test_fork) to track forking behavior, we discovered that the BatchSpanProcessor gets reinitialized many times in the SSH subprocesses. We registered this function using os.register_at_fork(after_in_child=self.test_fork) and noticed the following log output:
2024-11-15 22:35:59,363Z INFO server.py:278 Otel in fork: PID=2402193
2024-11-15 22:35:59,641Z INFO server.py:280 Call stack:
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/server/server.py}:{279} in {test_fork}
{<gevent-23.9.1>/gevent/os.py}:{426} in {fork_and_watch}
{<gevent-23.9.1>/gevent/subprocess.py}:{1624} in {_execute_child}
{<gevent-23.9.1>/gevent/subprocess.py}:{807} in {__init__}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/base/command.py}:{217} in {timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{227} in {__timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{484} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{730} in {wrapper}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/hypervisor/kvm_ssh_client.py}:{111} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/hypervisor/ssh_utils.py}:{42} in {ssh_execute}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{689} in {_get_uefi_nvram_size}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{1308} in {_connect}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/util/hypervisor/connection.py}:{614} in {_reconnect_loop}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/misc/decorators.py}:{47} in {wrapper}
Which means BatchSpanprocessor gets reinit many times in these ssh subprocess. But checked the active threads by dumping all the greenlets in this service, only the batchSpanProcessor running in the main processor can be seen and others can not be seen.
We also tried adding a time.sleep(5) for test purpose before our opentelemetry BatchSpanProcessor initialization(our tracing initialization is at the end of service run), we can not see the error logs always but intermittently.
Any idea on how to solve this problem? Please help. Thanks!
More context: we are migrating SimpleSpanProcessor to BatchSpanProcessor now, and SimpleSpanProcessor can work well but have performance issue that we have tested.
Hi all, Did you find any solution for this so far?
Hi, facing the same issue. Any work around or solution for this.