sdk-python
sdk-python copied to clipboard
[Bug] GeneratorExit possibly causing issues on context detach in OTel finally
Describe the bug
Reports of:
Failed to detach context
Traceback (most recent call last):
File "/path/to/temporalio/contrib/opentelemetry.py", line 427, in _top_level_workflow_context
yield None
File "/path/to/temporalio/contrib/opentelemetry.py", line 340, in execute_workflow
return await super().execute_workflow(input)
GeneratorExit
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/path/to/opentelemetry/context/__init__.py", line 157, in detach
_RUNTIME_CONTEXT.detach(token) # type: ignore
File "/path/to/opentelemetry/context/contextvars_context.py", line 50, in detach
self._current_context.reset(token) # type: ignore
ValueError: <Token var=<ContextVar name='current_context' default={} at 0x79340011d590> at 0x7933a4b4c9c0> was created in a different Context"
There are outstanding issues like https://github.com/open-telemetry/opentelemetry-python/issues/2606 that may be related.
Is there any workaround for this? We have tons of these errors in our logs forcing us to disable tracing in our workers.
I am not aware of an obvious one (see linked upstream ticket). We may be able to make sure everywhere we detach
we aren't doing so in a GeneratorExit
.
However, we have found that GeneratorExit
has other problems, so we have made a fairly substantial change to workflow teardown in #499 that should prevent GeneratorExit
from occurring within common workflow use cases (i.e. we try to force tasks to complete before letting garbage collector take over and raise this). This will come out in next release, but if you'd like you can build main
now and test it. Your errors may go away.
I wouldn't be surprised if this is some OpenTelemetry users' issue too - GeneratorExit
can occur in a completely different thread than the async code was originally run within and may have lost all contextual information. It's a very dangerous thing Python does here (waking up tasks on GC in any thread).
@cretz As I commented on #2606 I was able to work around a similar issue. Feel free to contact me if you are interested in more details. https://github.com/open-telemetry/opentelemetry-python/issues/2606#issuecomment-1789515786
I suspect that if the otel teardown_request hook "opentelemetry-flask.activation_key" could somehow know that it was not in the correct context (e.g. on the wrong thread, or different async context) and therefore decide to do nothing, this might resolve the issue.
Hey @benliddicott I've been following this issue for python-otel, and put out a flask-specific fix awhile back: https://github.com/open-telemetry/opentelemetry-python-contrib/pull/1692. Since you're still recently having issues, I'm imagining that didn't resolve it for you (unless you're on an old version?)
Maybe the changes / PR desc could be helpful debugging your issue?
For the Temporal SDK here in this GH issue, we don't even use Flask. The GC can wake up the coroutine in any thread, including the same one it may have started in but is currently being used for something else by the thread pool. There's not much you can do but avoid letting coroutines/generators get GC'd (i.e. cancel and wait for graceful completion).
Ya sorry about the flask noise, probably better to post what I did back in https://github.com/open-telemetry/opentelemetry-python/issues/2606
I'm now on a new project where we're running into the same error, and it's moreso related to generators/coroutines that you are mentioning; that's why I'm also following this thread
No prob, yeah I flat out could not find a good solution besides not ever reaching GeneratorExit
(or coroutine GC). There's no way to disable this behavior, no way to intercept, and with threading (especially thread pools) no guarantee where it may wake up.
So that's how we solved this, just making sure all tasks complete (but leaving open until we release it)
@matthewgrossman I am not having issues, as I have a workaround, as described in https://github.com/open-telemetry/opentelemetry-python/issues/2606#issuecomment-1789515786 so there is no urgency from my side.
@cretz I haven't tried to find out if my workaround is still necessary in the latest release - all I can say is that the version we are using hasn't broken the workaround.
@benliddicott would you be available to share the code you used as a solution to this? I am running into the same issue as well.