opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

OtelBatchSpanProcessor threading KeyError

Open frx08 opened this issue 1 year ago • 6 comments

Environment:

ubuntu:22.04
python:3.10

libs installed:

...
opentelemetry-api==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-exporter-gcp-trace==1.6.0
...

app.py

from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

tracer_provider = TracerProvider()
trace.set_tracer_provider(tracer_provider)
gcp_exporter = CloudTraceSpanExporter()
span_processor = BatchSpanProcessor(gcp_exporter)
tracer_provider.add_span_processor(span_processor)

running with: gunicorn --worker-class geventwebsocket.gunicorn.workers.GeventWebSocketWorker -w 1 --threads 20 app:app

I'm running my service on gcloud app engine with this configuration but many times when I create a new span I get the following exception and the service is restarted (i'm using supervisord)

Traceback (most recent call last):    File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run    File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.10/threading.py", line 1008, in _bootstrap_inner
    del _limbo[self]
KeyError: <Thread(OtelBatchSpanProcessor, stopped daemon 139781863898496)>
2024-04-26T18:20:34Z <Greenlet at 0x7f218057fd80: <bound method Thread._bootstrap of <Thread(OtelBatchSpanProcessor, stopped daemon 139781863898496)>>> failed with KeyError
[2024-04-26 18:20:35 +0000] [8] [ERROR] Worker (pid:12) was sent code 139!
2024-04-26 18:20:35,177 INFO reaped unknown pid 164 (exit status 0)
2024-04-26 18:20:35,179 INFO reaped unknown pid 147 (exit status 0)
2024-04-26 18:20:35,179 INFO reaped unknown pid 173 (exit status 0)

do you know how I can address this issue? As you can see I'm using gevent==23.9.1

frx08 avatar Apr 26 '24 18:04 frx08

If you use gevent you should monkey patch everything you use, and I don't see it from your code

xrmx avatar Apr 29 '24 15:04 xrmx

sorry I didn't include it, yes i'm patching everything at the very start of my script

from gevent import monkey
monkey.patch_all()

frx08 avatar Apr 30 '24 06:04 frx08

Observed similar issue recently after switching to BatchSpanProcessor

yzhuang93 avatar Nov 07 '24 22:11 yzhuang93

Also we did monkey patch as well in our code

yzhuang93 avatar Nov 07 '24 22:11 yzhuang93

Issue: Conflict Between BatchSpanProcessor and Gevent in Service

Environment:

  • Gevent: 23.9.1
  • Python: 3.9.19
  • OS: Rocky Linux 8.10 (Green Obsidian)
  • OpenTelemetry: 1.27.0

Service Overview:

The service is complex, utilizing the _thread module to create new threads and the subprocess module to manage multiple subprocesses (especially for SSH clients). Additionally, the service uses gevent.monkey.patch_all() at the very beginning to enable cooperative multitasking. Note: the service frequently create subprocess witg module subprocess all the time to do some None python or gevent related logic.

Errors Encountered:

1. Error During BatchSpanProcessor Initialization:

When initializing BatchSpanProcessor, the following error is observed:

Traceback (most recent call last):
  File "src/gevent/_abstract_linkable.py", line 287, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
  File "src/gevent/_abstract_linkable.py", line 333, in gevent._gevent_c_abstract_linkable.AbstractLinkable._notify_links
AssertionError: (None, <callback at 0x7f522d2748c0 args=([],)>)

2. Error After Service Runs for Some Time(within 30 secs):

After the service runs for a while, the following error appears multiple times with different daemon thread IDs:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/usr/lib64/python3.9/threading.py", line 937, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib64/python3.9/threading.py", line 972, in _bootstrap_inner
    del _limbo[self]
KeyError: <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>
2024-11-15T22:20:58Z <Greenlet at 0x7f7024090400: <bound method Thread._bootstrap of <Thread(OtelBatchSpanProcessor, stopped daemon 140119617635328)>>> failed with KeyError

These errors affect the other threads running in the service and eventually cause the service to enter an unhealthy state, with some hosts becoming unreachable.

3.Investigation Process:

As we mentioned above, the problem we find is when we initialize BatchSpanProcessor in service main process. We found that os.register_at_fork(after_in_child=self._at_fork_reinit) may be the issue. By adding a log in a function test_fork()

def test_fork(self):
    log.INFO("Otel in fork: PID=%s" %os.getpid())
    stack = inspect.stack()
    log.INFO("Call stack:\n" + "\n".join(["{%s}:{%s} in {%s}" %(frame.filename, frame.lineno, frame.function)for frame in stack])) 

By adding a log in a function (test_fork) to track forking behavior, we discovered that the BatchSpanProcessor gets reinitialized many times in the SSH subprocesses. We registered this function using os.register_at_fork(after_in_child=self.test_fork) and noticed the following log output:

2024-11-15 22:35:59,363Z INFO server.py:278 Otel in fork: PID=2402193
2024-11-15 22:35:59,641Z INFO server.py:280 Call stack:
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/server/server.py}:{279} in {test_fork}
{<gevent-23.9.1>/gevent/os.py}:{426} in {fork_and_watch}
{<gevent-23.9.1>/gevent/subprocess.py}:{1624} in {_execute_child}
{<gevent-23.9.1>/gevent/subprocess.py}:{807} in {__init__}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/base/command.py}:{217} in {timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{227} in {__timed_command}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{484} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/ssh_client.py}:{730} in {wrapper}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/net/hypervisor/kvm_ssh_client.py}:{111} in {execute}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/hypervisor/ssh_utils.py}:{42} in {ssh_execute}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{689} in {_get_uefi_nvram_size}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/acropolis/hypervisor/kvm/libvirt_connection.py}:{1308} in {_connect}
{<HOME>/.venvs/bin/lib/python3.9/site-packages/util/hypervisor/connection.py}:{614} in {_reconnect_loop}
{<HOME>/.venvs/bin/lib64/python3.9/site-packages/util/misc/decorators.py}:{47} in {wrapper}

Which means BatchSpanprocessor gets reinit many times in these ssh subprocess. But checked the active threads by dumping all the greenlets in this service, only the batchSpanProcessor running in the main processor can be seen and others can not be seen.

We also tried adding a time.sleep(5) for test purpose before our opentelemetry BatchSpanProcessor initialization(our tracing initialization is at the end of service run), we can not see the error logs always but intermittently.

Any idea on how to solve this problem? Please help. Thanks!

More context: we are migrating SimpleSpanProcessor to BatchSpanProcessor now, and SimpleSpanProcessor can work well but have performance issue that we have tested.

scxingnutanix avatar Nov 16 '24 01:11 scxingnutanix

Hi all, Did you find any solution for this so far?

saichander17 avatar Jun 09 '25 14:06 saichander17

Hi, facing the same issue. Any work around or solution for this.

drajamal avatar Oct 29 '25 09:10 drajamal