opentelemetry-python Dead lock in context init

Describe your environment We are using python 3.8 on Amazon Linux 2 and opentelemetry version 1.18. We have django application and are using it inside docker.

When calling

docker-compose run --rm app bash -c "echo \"show tables; exit \" | ./manage.py dbshell -- -A | grep auth_user_groups"

with tracing enabled the app is frozen and stuck in opentelemetry code when obtaining runtime context lock and it is not released. This code runs external process (mysql) based on the django configuration and does not create multiple threads (except jaeger exporter that we use):

Traceback (most recent call last):
  File "/app/common/gc/metrics.py", line 65, in callback
    self.trace = tracing.trace_op(OPERATION_NAME)
  File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/btspython/tracing/tracer.py", line 296, in trace_op
    self.ensure_request_trace_id()
  File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/btspython/tracing/tracer.py", line 366, in ensure_request_trace_id
    request_trace_id = baggage_api.get_baggage("request-trace-id")  # type: ignore
  File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/baggage/__init__.py", line 67, in get_baggage
    return get_all(context=context).get(name)
  File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/baggage/__init__.py", line 47, in get_all
    baggage = get_value(_BAGGAGE_KEY, context=context)
  File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/context/__init__.py", line 102, in get_value
    return context.get(key) if context is not None else get_current().get(key)
  File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/context/__init__.py", line 48, in wrapper
    with _RUNTIME_CONTEXT_LOCK:
KeyboardInterrupt:

opentelemetry-api==1.18.0
opentelemetry-exporter-jaeger-thrift==1.18.0
opentelemetry-sdk==1.18.0
opentelemetry-semantic-conventions==0.39b0

Any idea how to track down this issue? Maybe some context is unable to initialize? where do i get a list?

Steps to reproduce It is not realiably reproducible (it works on local machine).

What is the expected behavior? Not to lock.

What is the actual behavior? It deadlocks and never completes.

Jul 13 '23 18:07 matejsp

Ran into this bug today. It looks like the exception handler within _load_runtime_context attempts to log the exception while still holding the lock on _RUNTIME_CONTEXT_LOCK, resulting in the deadlock behaviour.

Aug 02 '23 15:08 dhausauer-flx

This sounds like a bug for sure. @dhausauer-flx if you know a fix, would you mind submitting a PR? Otherwise if someone can provide a minimal repro (I understand it's hard to reproduce but just some code to run locally), we can go from there.

Aug 03 '23 16:08 aabmass

This sounds like a bug for sure. @dhausauer-flx if you know a fix, would you mind submitting a PR? Otherwise if someone can provide a minimal repro (I understand it's hard to reproduce but just some code to run locally), we can go from there.

I don't know how to write a reliable repro for this, but does changing the locking behavior to _RUNTIME_CONTEXT_LOCK = threading.RLock() not solve this problem? If so, is there any way to quantify the performance impact generally, which I presume was the reason for using a plain Lock in the first place?

Jan 22 '24 12:01 garry-cairns

Using an RLock sounds like a good idea in general, but I'm not sure that it will solve this problem (hard to say because I don't think we fully understand it yet). My understanding is that RLock will only solve unlocking within the same thread, but is lock sharing within the same thread happening? (looks to me like the same thread doesn't request the lock more than once)

Ran into this bug today. It looks like the exception handler within _load_runtime_context attempts to log the exception while still holding the lock on _RUNTIME_CONTEXT_LOCK, resulting in the deadlock behaviour.

Interesting observation @dhausauer-flx. We could release the lock before handling the exception.

More generally, I'm wondering why we're loading an opentelemetry_context implementation at runtime. The default behavior uses python's contextvars which seems reasonable. If dynamic loading of a context loader is not part of the spec (and I'm not seeing it there) perhaps it's not needed anymore now that contextvars is available in all supported python versions (it was added in 3.7).

Jan 22 '24 20:01 pmcollins

I think that for our case it might solve the problem. Basically what we did was tracing GC with opentelemetry so we could see the impact of GC (new trace) on our rest requests.

In the end what happend for us was that when GC was called inside _RUNTIME_CONTEXT_LOCK section and gc phase tried again to obtain the lock making it a dead lock.

RLock should effectevly solve the problem. Question is if there is any downside to using RLock vs Lock. (like performance or perhaps that another thread need to unlock it (asyncio perhaps ?)

Jan 22 '24 20:01 matejsp

Would like to discuss if we should continue to let folks bring their own opentelemetry_context implementation at all, or if we should instead just instantiate a ContextVarsRuntimeContext, removing the complexity around loading a custom implementation lazily, synchronizing access, etc..

@ocelotl @aabmass @lzchen what do you think? (happy to move this discussion elsewhere, too)

Jan 25 '24 18:01 pmcollins

@pmcollins

Might be good to bring this up in the SIG.

Apr 22 '24 15:04 lzchen

opentelemetry-python opentelemetry-python copied to clipboard

Dead lock in context init

opentelemetry-python
opentelemetry-python copied to clipboard