openinference icon indicating copy to clipboard operation
openinference copied to clipboard

[BUG]: opentelemetry.context - ERROR - Failed to detach context

Open ryzhiy-kot opened this issue 10 months ago • 9 comments

Where do you use Phoenix

Local (self-hosted)

What version of Phoenix are you using?

10.10.0

What operating system are you seeing the problem on?

windows

What version of Python are you running Phoenix with?

3.12

What version of Python or Node are you using instrumentation with?

python 3.12

What instrumentation are you using?

arize-phoenix==10.10.0 arize-phoenix-client==1.10.0 arize-phoenix-evals==0.20.8 arize-phoenix-otel==0.10.3 opentelemetry-api 1.34.0 opentelemetry-exporter-gcp-trace 1.9.0
opentelemetry-exporter-otlp 1.34.0 opentelemetry-exporter-otlp-proto-common 1.34.0 opentelemetry-exporter-otlp-proto-grpc 1.34.0 opentelemetry-exporter-otlp-proto-http 1.34.0 opentelemetry-instrumentation 0.55b0 opentelemetry-proto 1.34.0 opentelemetry-resourcedetector-gcp 1.9.0a0 opentelemetry-sdk 1.34.0 opentelemetry-semantic-conventions 0.55b0

#GOOGLE ADK tools google-adk==1.2.1 google-api-core==2.25.0 google-api-python-client==2.171.0 google-auth==2.40.3 google-auth-httplib2==0.2.0 google-cloud-aiplatform==1.96.0 google-cloud-appengine-logging==1.6.1 google-cloud-audit-log==0.3.2 google-cloud-bigquery==3.34.0 google-cloud-core==2.4.3 google-cloud-logging==3.12.1 google-cloud-resource-manager==1.14.2 google-cloud-secret-manager==2.24.0 google-cloud-speech==2.32.0 google-cloud-storage==2.19.0 google-cloud-trace==1.16.1 google-crc32c==1.7.1 google-genai==1.19.0 google-resumable-media==2.7.2 googleapis-common-protos==1.70.0

What happened?

2025-06-10 16: 22: 28 - opentelemetry.context - ERROR - Failed to detach context Traceback(most recent call last): File "e:\temp.venv\Lib\site-packages\opentelemetry\trace_init_.py", line 589, in use_span yield span File "e:\temp.venv\Lib\site-packages\openinference\instrumentation_tracers.py", line 140, in start_as_current_span yield cast(OpenInferenceSpan, current_span) File "e:\temp.venv\Lib\site-packages\opentelemetry\trace_init_.py", line 454, in start_as_current_span yield span File "e:\temp.venv\Lib\site-packages\google\adk\runners.py", line 200, in run_async yield event asyncio.exceptions.CancelledError

What did you expect to happen?

No exception

How can we reproduce the bug?

I'm using GOOGLE ADK for agents development. The error occurs when a root agent finishes execution. Intermediary model invocations and responses are captured correctly, but the very last invocation causes this exception.

Additional information

No response

ryzhiy-kot avatar Jun 11 '25 03:06 ryzhiy-kot

Now I'm facing a different problem: ValueError: <Token var=<ContextVar name='current_context' default={} at 0x0000019F460692B0> at 0x0000019F59C8EBC0> was created in a different Context

ryzhiy-kot avatar Jun 11 '25 06:06 ryzhiy-kot

Thanks for the report. Do you have a code snippet that I can use to reproduce this issues?

RogerHYang avatar Jun 11 '25 15:06 RogerHYang

We just did some investigation into this and are looking into a workaround, but error message itself is most likely harmless.

RogerHYang avatar Jun 11 '25 21:06 RogerHYang

This issue can be reproduced in Python outside of any instrumentation library as shown by the code snippet below. The root cause is a context mismatch during exception handling. When an exception occurs, Python calls the contextmanager to perform cleanup, and in this case the cleanup involves detaching the token. However, the cleanup actually runs in a different context than the one where the contextmanager initialized the token. This context mismatch causes the detach operation to fail.

import asyncio

from opentelemetry.sdk.trace import TracerProvider

tracer = TracerProvider().get_tracer("test")

async def g():
    yield 1
    
async def f():
    with tracer.start_as_current_span("test"):
        async for _ in g():
            yield _
            
async def main():
    async for _ in f():
        raise ValueError()

asyncio.run(main())

The coding pattern of the snippet above is that of combining start_as_current_span with a generator, but that same pattern is used inside the ADK source code, e.g. here. Since our intsrumentation library actually relies (partially) on the ADK source code for the span creation, it's not straightforward to suppress the error message in this scenario. However, the error message itself is most likely harmless given the root cause described above.

P.S. As additional illustrations, the code snippets below are slight variations that raise the exception in different contexts. Neither of these has a problem with detaching the token because the cleanup takes place in the same context as the one that initialized the token.

import asyncio

from opentelemetry.sdk.trace import TracerProvider

tracer = TracerProvider().get_tracer("test")

async def g():
    yield 1
    
async def f():
    with tracer.start_as_current_span("test"):
        async for _ in g():
            yield _
            raise ValueError()
            
async def main():
    async for _ in f():
        ...

asyncio.run(main())
import asyncio

from opentelemetry.sdk.trace import TracerProvider

tracer = TracerProvider().get_tracer("test")

async def g():
    yield 1
    raise ValueError()
    
async def f():
    with tracer.start_as_current_span("test"):
        async for _ in g():
            yield _
            
async def main():
    async for _ in f():
        ...

asyncio.run(main())

RogerHYang avatar Jun 12 '25 22:06 RogerHYang

Unfortunately, unwinding the ContextVar in this situation appears to be impossible. The code snippet below demonstrates that the ContextVar value persists after the failed attempt to reset. This appears to be a limitation of Python itself.

Image
import asyncio
from contextvars import ContextVar

cv = ContextVar[float]("x", default=42)


async def f():
    # Set value and get token in Context A
    token = cv.set(3.14)
    print(f"Generator set cv to: {cv.get()}")

    try:
        yield 1
    finally:
        # We're now in Context B (due to exception from main)
        print(f"Finally block inside generator: {cv.get()}")

        try:
            cv.reset(token)  # This fails
        except ValueError as exc:
            print(f"✗ Token reset failed: {exc}")


async def main():
    print(f"Initial default: {cv.get()}")

    try:
        async for _ in f():
            raise RuntimeError()
    except RuntimeError:
        print("Handling RuntimeError")

    # Value persists because token reset failed
    print(f"After generator: {cv.get()}")


asyncio.run(main())

RogerHYang avatar Jun 13 '25 14:06 RogerHYang

See the official Python explanation here (it's not a bug).

RogerHYang avatar Jun 13 '25 15:06 RogerHYang

Existing discussion from OTEL here.

RogerHYang avatar Jun 13 '25 15:06 RogerHYang

Unfortunately we cannot fix this and marking it as blocked.

mikeldking avatar Jul 11 '25 16:07 mikeldking

I came across this issue and I found out that it was due to a return inside an async for loop that iterated adk's Runner.run_async returned generator. Maybe this helps someone 🙂

More info: https://github.com/langfuse/langfuse/issues/8316#issuecomment-3564078695

pedroter7 avatar Nov 21 '25 18:11 pedroter7