opentelemetry-python
opentelemetry-python copied to clipboard
Dead lock in context init
Describe your environment We are using python 3.8 on Amazon Linux 2 and opentelemetry version 1.18. We have django application and are using it inside docker.
When calling
docker-compose run --rm app bash -c "echo \"show tables; exit \" | ./manage.py dbshell -- -A | grep auth_user_groups"
with tracing enabled the app is frozen and stuck in opentelemetry code when obtaining runtime context lock and it is not released. This code runs external process (mysql) based on the django configuration and does not create multiple threads (except jaeger exporter that we use):
Traceback (most recent call last):
File "/app/common/gc/metrics.py", line 65, in callback
self.trace = tracing.trace_op(OPERATION_NAME)
File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/btspython/tracing/tracer.py", line 296, in trace_op
self.ensure_request_trace_id()
File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/btspython/tracing/tracer.py", line 366, in ensure_request_trace_id
request_trace_id = baggage_api.get_baggage("request-trace-id") # type: ignore
File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/baggage/__init__.py", line 67, in get_baggage
return get_all(context=context).get(name)
File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/baggage/__init__.py", line 47, in get_all
baggage = get_value(_BAGGAGE_KEY, context=context)
File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/context/__init__.py", line 102, in get_value
return context.get(key) if context is not None else get_current().get(key)
File "/opt/virtualenvs/py38/lib64/python3.8/site-packages/opentelemetry/context/__init__.py", line 48, in wrapper
with _RUNTIME_CONTEXT_LOCK:
KeyboardInterrupt:
opentelemetry-api==1.18.0
opentelemetry-exporter-jaeger-thrift==1.18.0
opentelemetry-sdk==1.18.0
opentelemetry-semantic-conventions==0.39b0
Any idea how to track down this issue? Maybe some context is unable to initialize? where do i get a list?
Steps to reproduce It is not realiably reproducible (it works on local machine).
What is the expected behavior? Not to lock.
What is the actual behavior? It deadlocks and never completes.
Ran into this bug today. It looks like the exception handler within _load_runtime_context attempts to log the exception while still holding the lock on _RUNTIME_CONTEXT_LOCK, resulting in the deadlock behaviour.
This sounds like a bug for sure. @dhausauer-flx if you know a fix, would you mind submitting a PR? Otherwise if someone can provide a minimal repro (I understand it's hard to reproduce but just some code to run locally), we can go from there.
This sounds like a bug for sure. @dhausauer-flx if you know a fix, would you mind submitting a PR? Otherwise if someone can provide a minimal repro (I understand it's hard to reproduce but just some code to run locally), we can go from there.
I don't know how to write a reliable repro for this, but does changing the locking behavior to _RUNTIME_CONTEXT_LOCK = threading.RLock() not solve this problem? If so, is there any way to quantify the performance impact generally, which I presume was the reason for using a plain Lock in the first place?
Using an RLock sounds like a good idea in general, but I'm not sure that it will solve this problem (hard to say because I don't think we fully understand it yet). My understanding is that RLock will only solve unlocking within the same thread, but is lock sharing within the same thread happening? (looks to me like the same thread doesn't request the lock more than once)
Ran into this bug today. It looks like the exception handler within
_load_runtime_contextattempts to log the exception while still holding the lock on_RUNTIME_CONTEXT_LOCK, resulting in the deadlock behaviour.
Interesting observation @dhausauer-flx. We could release the lock before handling the exception.
More generally, I'm wondering why we're loading an opentelemetry_context implementation at runtime. The default behavior uses python's contextvars which seems reasonable. If dynamic loading of a context loader is not part of the spec (and I'm not seeing it there) perhaps it's not needed anymore now that contextvars is available in all supported python versions (it was added in 3.7).
I think that for our case it might solve the problem. Basically what we did was tracing GC with opentelemetry so we could see the impact of GC (new trace) on our rest requests.
In the end what happend for us was that when GC was called inside _RUNTIME_CONTEXT_LOCK section and gc phase tried again to obtain the lock making it a dead lock.
RLock should effectevly solve the problem. Question is if there is any downside to using RLock vs Lock. (like performance or perhaps that another thread need to unlock it (asyncio perhaps ?)
Would like to discuss if we should continue to let folks bring their own opentelemetry_context implementation at all, or if we should instead just instantiate a ContextVarsRuntimeContext, removing the complexity around loading a custom implementation lazily, synchronizing access, etc..
@ocelotl @aabmass @lzchen what do you think? (happy to move this discussion elsewhere, too)
@pmcollins
Might be good to bring this up in the SIG.