dd-trace-py
dd-trace-py copied to clipboard
Asyncio issues due to `DatadogContextMixin._update_active`
Summary of problem
Due to the way DatadogContextMixin._update_active
is implemented and used, there is a clear issue with context propagation across asynchronously scheduled tasks. Suppose you have a span active, and you have some coroutine wrapped with @tracer.wrap()
, say f
. Running await f()
yields the expected results - contextvars
correctly propagate the context, parenting is correct. Now suppose you want to schedule a task for f
asynchronously, using for example asyncio.create_task(f())
. contextvars
again propagate the current context correctly. However, in cases where the span is finished before the task is actually scheduled and started, when it finally starts the context is already invalid in the sense the original span is not active, thus DatadogContextMixin._update_active
goes through the "parent-tree" upwards until it hits None
or an active span.
This can happen quite easily due to the way asyncio
works. If the caller of f
does not yield control by awaiting something, and the traced function/piece of code ends, the task for f
does not get a chance to be scheduled. Therefore when the newly created span for f
checks tracer.context_provider.active()
it does not get the original span back.
Therefore if care is not taken to correctly yield control during the rest of the traced code after the task creation, this results in weird parenting. I would assume that the span created for f
would in fact inherit the original span as the parent, even if it ended already. I am not sure that this is allowed in DD, but it is definitely something that I think should work in some sense. In my case this results in a lot of traces which seem wrong - the spans should have a clear hierarchical structure, but it is all messed up and the parenting is all over the place.
A potential solution for the client of dd-trace-py
could be to ensure yielding by calling await asyncio.sleep(0)
at the very minimum. I am not 100% certain that this will fix all the issues however, as the asyncio loop internals are a bit arcane to me (and depend on the implementation used).
Which version of dd-trace-py are you using?
Latest (2.7.1).
How can we reproduce your problem?
import asyncio
from ddtrace import tracer
@tracer.wrap()
async def task(i):
print("task", i, tracer.context_provider.active())
await asyncio.sleep(0.1)
@tracer.wrap()
async def runner(wait=False):
print("runner", tracer.context_provider.active())
await task(1)
t = asyncio.create_task(task(2))
if wait:
await asyncio.sleep(0)
return t
async def test():
print("no yielding")
await runner()
print("yielding")
await runner(True)
asyncio.run(test())
What is the result that you get?
no yielding
runner <Span(id=7628650944197831528,trace_id=135462679929056883042018941076757191426,parent_id=None,name=__main__.runner)>
task 1 <Span(id=8085636790360746386,trace_id=135462679929056883042018941076757191426,parent_id=7628650944197831528,name=__main__.task)>
yielding
runner <Span(id=10895955304451890772,trace_id=135462679929056883038997047737130036160,parent_id=None,name=__main__.runner)>
task 1 <Span(id=14103984175716321618,trace_id=135462679929056883038997047737130036160,parent_id=10895955304451890772,name=__main__.task)>
task 2 <Span(id=3001407150615082204,trace_id=135462679929056883030652120662978821080,parent_id=None,name=__main__.task)>
task 2 <Span(id=14852889101994205982,trace_id=135462679929056883038997047737130036160,parent_id=10895955304451890772,name=__main__.task)>
What is the result that you expected?
I expected <Span(id=3001407150615082204,trace_id=135462679929056883030652120662978821080,parent_id=None,name=__main__.task)>
to have the correct parent.
@mklokocka Thanks for the report and the simple reproduction case. We'll take a look and see if there's a straightforward way to fix this issue in ddtrace.
Also worth to note that this issue can also happen when creating an asyncio task in a non-coroutine context. For example, I set up some data loading tasks in a tracer wrapped __init__
of a class by calling asyncio.create_task
. Span for __init__
is made, but I cannot use the yield trick to slow down the end of the __init__
execution and thus prolong the span until the tasks actually start.
As I said I am not 100% certain there is a way to properly support that in DD. Depends on whether having spans being parented by spans which are finished before they start would break something.
Hi @mklokocka,
We are actually deprecating the ddtrace asyncio
integration in the next release (v2.8.0): https://github.com/DataDog/dd-trace-py/pull/8612. The asyncio integration and async context provider were added to support using asyncio and contextvars with python<3.7
. Since ddtrace v2.0 dropped support python<3.7
there is no need to enable this integration.
If you are using ddtrace>=2.0
OR python>=3.7
, you can disable the asyncio integration by setting the following environment variable DD_TRACE_ASYNCIO_ENABLED=False
.
Feel free to reopen this issue if this doesn't work for you 😄
Hi @mabdinur, I am sorry, but this is not an issue with the integration for asyncio
, but rather with how the context propagation with contextvars
works. So removing the old integration shouldn't matter for what I have described at all.
I don't think I can reopen the issue.
Sorry for the confusion,
You're right the issue lies in DatadogContextMixin and not the deprecated AsyncioContextProvider. Using asyncio.sleep(0)
will suspend the execution of the active co-routine and give newly scheduled tasks an opportunity to run. When the newly scheduled tasks are started the tracer.wrap(..) instrumentation is executed. This is because the python tracer parents async tasks based on when the task was executed and NOT when the task was created. I can see how this behavior can be confusing and it is likely a bug.
Using asyncio.sleep(0)
to begin the execution of scheduled tasks seems like a reasonable change however there might be edge cases and runtimes where this operation does not start ALL scheduled tasks. I think a better approach could be to capture the current active span when a task is scheduled and ensure it is used to parent downstream tasks. I will open a PR to propose this new behavior.
Thanks again for bringing this issue to our attention
Also consider the synchronous situation - you can run asyncio.create_task(...)
for a traced coroutine from a normal, sync, function. If the function is traced, the span for it will finish before the task is scheduled (and the span for the scheduled task starts).
Question is whether Datadog supports spans that start after they parent span already ended.
Datadog supports starting a span after a parent has ended. This is a valid operation:
from ddtrace import tracer
parent = tracer.trace("parent")
parent.finish()
child = tracer.start_span("child", child_of=parent)
child.finish()
However this is not the default behavior. By default the current active span become a parent span when a new span is created.
I looked at your PR. Hmm, I am not quite sure about it. First, there are also the lower level loop.create_task
and asyncio.ensure_future
methods which this does not cover. Second, I think you actually have to await the wrapped coro
in traced_coro
, otherwise you just return a reference.
What are the reasons to keep DatadogContextMixin._update_active
as is?
Sorry about that. The PR was in a rough in a state. It is finally reviewable and should address both of your concerns.
The duration of a parent span is often independent of the parent span. In aync contextes the following traces are valid:
Case 1: Parent span ends after all child spans [parent 1 ] [child 1] [child 2]
Case 2: Parent span ends before child spans [parent 1 ] [child 1] [child 2]
Case 3: Parent span ends before child spans start [parent 1 ] [child 1] [child 2]
To make this asyncio fix DatadogContextMixin._update_active
we will need to delay finishing a parent span until all spawned coroutines are started with the correct trace context. Python v3.12 introduced support for eagerly starting coroutines. This provides an API for executing coroutines as they're scheduled. Leveraging this API could allow us to support the span parenting in _update_active
but this functionality does not exist in python<3.12 😞
tldr: Supporting this functionality on span finish and detecting when an async task starts executing could be tricky. I think it is best to provide this functionality on span start by patching the BaseEventLoop.
So all three of asyncio.create_task
, asyncio.ensure_future
, and loop.create_task
are gonna work?
How is this going to work with using a different event loop implementation, e.g. uvloop
?
@mabdinur Sorry to bother even after closing, but shouldn't the MR be patching AbstractEventLoop
somehow instead to ensure compatibility with third party eventloops?
@mklokocka Good point. Instrumenting the AbstractEventLoop.create_task instead of BaseEventLoop.create_task
would be ideal but the ddtrace library does not support instrumenting abstract methods. The ddtrace libraries applies instrumentation to methods that are called, since AbstractEventLoop.create_task
is unimplemented and therefore never called by subclasses it can't be traced.
We will need to introduce a new a set of instrumentation to support this functionality. For example introduce a helper that wraps the initialization of an object and ensures all instances of abstract classes have the expected instrumentation (however this will add significant overhead of creating objects). This is something we can look at as a separate feature request. I will create a card to track this. Thanks again for you insight 😄