dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

Asyncio issues due to `DatadogContextMixin._update_active`

Open mklokocka opened this issue 11 months ago • 8 comments

Summary of problem

Due to the way DatadogContextMixin._update_active is implemented and used, there is a clear issue with context propagation across asynchronously scheduled tasks. Suppose you have a span active, and you have some coroutine wrapped with @tracer.wrap(), say f. Running await f() yields the expected results - contextvars correctly propagate the context, parenting is correct. Now suppose you want to schedule a task for f asynchronously, using for example asyncio.create_task(f()). contextvars again propagate the current context correctly. However, in cases where the span is finished before the task is actually scheduled and started, when it finally starts the context is already invalid in the sense the original span is not active, thus DatadogContextMixin._update_active goes through the "parent-tree" upwards until it hits None or an active span.

This can happen quite easily due to the way asyncio works. If the caller of f does not yield control by awaiting something, and the traced function/piece of code ends, the task for f does not get a chance to be scheduled. Therefore when the newly created span for f checks tracer.context_provider.active() it does not get the original span back.

Therefore if care is not taken to correctly yield control during the rest of the traced code after the task creation, this results in weird parenting. I would assume that the span created for f would in fact inherit the original span as the parent, even if it ended already. I am not sure that this is allowed in DD, but it is definitely something that I think should work in some sense. In my case this results in a lot of traces which seem wrong - the spans should have a clear hierarchical structure, but it is all messed up and the parenting is all over the place.

A potential solution for the client of dd-trace-py could be to ensure yielding by calling await asyncio.sleep(0) at the very minimum. I am not 100% certain that this will fix all the issues however, as the asyncio loop internals are a bit arcane to me (and depend on the implementation used).

Which version of dd-trace-py are you using?

Latest (2.7.1).

How can we reproduce your problem?

import asyncio
from ddtrace import tracer

@tracer.wrap()
async def task(i):
    print("task", i, tracer.context_provider.active())
    await asyncio.sleep(0.1)

@tracer.wrap()
async def runner(wait=False):
    print("runner", tracer.context_provider.active())
    await task(1)
    t = asyncio.create_task(task(2))
    if wait:
        await asyncio.sleep(0)
    return t

async def test():
    print("no yielding")
    await runner()
    print("yielding")
    await runner(True)

asyncio.run(test())

What is the result that you get?

no yielding
runner <Span(id=7628650944197831528,trace_id=135462679929056883042018941076757191426,parent_id=None,name=__main__.runner)>
task 1 <Span(id=8085636790360746386,trace_id=135462679929056883042018941076757191426,parent_id=7628650944197831528,name=__main__.task)>
yielding
runner <Span(id=10895955304451890772,trace_id=135462679929056883038997047737130036160,parent_id=None,name=__main__.runner)>
task 1 <Span(id=14103984175716321618,trace_id=135462679929056883038997047737130036160,parent_id=10895955304451890772,name=__main__.task)>
task 2 <Span(id=3001407150615082204,trace_id=135462679929056883030652120662978821080,parent_id=None,name=__main__.task)>
task 2 <Span(id=14852889101994205982,trace_id=135462679929056883038997047737130036160,parent_id=10895955304451890772,name=__main__.task)>

What is the result that you expected?

I expected <Span(id=3001407150615082204,trace_id=135462679929056883030652120662978821080,parent_id=None,name=__main__.task)> to have the correct parent.

mklokocka avatar Mar 07 '24 02:03 mklokocka

@mklokocka Thanks for the report and the simple reproduction case. We'll take a look and see if there's a straightforward way to fix this issue in ddtrace.

emmettbutler avatar Mar 07 '24 21:03 emmettbutler

Also worth to note that this issue can also happen when creating an asyncio task in a non-coroutine context. For example, I set up some data loading tasks in a tracer wrapped __init__ of a class by calling asyncio.create_task. Span for __init__ is made, but I cannot use the yield trick to slow down the end of the __init__ execution and thus prolong the span until the tasks actually start.

As I said I am not 100% certain there is a way to properly support that in DD. Depends on whether having spans being parented by spans which are finished before they start would break something.

mklokocka avatar Mar 07 '24 22:03 mklokocka

Hi @mklokocka,

We are actually deprecating the ddtrace asyncio integration in the next release (v2.8.0): https://github.com/DataDog/dd-trace-py/pull/8612. The asyncio integration and async context provider were added to support using asyncio and contextvars with python<3.7. Since ddtrace v2.0 dropped support python<3.7 there is no need to enable this integration.

If you are using ddtrace>=2.0 OR python>=3.7, you can disable the asyncio integration by setting the following environment variable DD_TRACE_ASYNCIO_ENABLED=False.

Feel free to reopen this issue if this doesn't work for you 😄

mabdinur avatar Mar 11 '24 16:03 mabdinur

Hi @mabdinur, I am sorry, but this is not an issue with the integration for asyncio, but rather with how the context propagation with contextvars works. So removing the old integration shouldn't matter for what I have described at all.

I don't think I can reopen the issue.

mklokocka avatar Mar 11 '24 17:03 mklokocka

Sorry for the confusion,

You're right the issue lies in DatadogContextMixin and not the deprecated AsyncioContextProvider. Using asyncio.sleep(0) will suspend the execution of the active co-routine and give newly scheduled tasks an opportunity to run. When the newly scheduled tasks are started the tracer.wrap(..) instrumentation is executed. This is because the python tracer parents async tasks based on when the task was executed and NOT when the task was created. I can see how this behavior can be confusing and it is likely a bug.

Using asyncio.sleep(0) to begin the execution of scheduled tasks seems like a reasonable change however there might be edge cases and runtimes where this operation does not start ALL scheduled tasks. I think a better approach could be to capture the current active span when a task is scheduled and ensure it is used to parent downstream tasks. I will open a PR to propose this new behavior.

Thanks again for bringing this issue to our attention

mabdinur avatar Mar 11 '24 20:03 mabdinur

Also consider the synchronous situation - you can run asyncio.create_task(...) for a traced coroutine from a normal, sync, function. If the function is traced, the span for it will finish before the task is scheduled (and the span for the scheduled task starts).

Question is whether Datadog supports spans that start after they parent span already ended.

mklokocka avatar Mar 11 '24 21:03 mklokocka

Datadog supports starting a span after a parent has ended. This is a valid operation:

from ddtrace import tracer

parent = tracer.trace("parent")
parent.finish()

child = tracer.start_span("child", child_of=parent)
child.finish()
Screenshot 2024-03-12 at 11 26 29 AM

However this is not the default behavior. By default the current active span become a parent span when a new span is created.

mabdinur avatar Mar 12 '24 15:03 mabdinur

I looked at your PR. Hmm, I am not quite sure about it. First, there are also the lower level loop.create_task and asyncio.ensure_future methods which this does not cover. Second, I think you actually have to await the wrapped coro in traced_coro, otherwise you just return a reference.

What are the reasons to keep DatadogContextMixin._update_active as is?

mklokocka avatar Mar 12 '24 16:03 mklokocka

Sorry about that. The PR was in a rough in a state. It is finally reviewable and should address both of your concerns.

The duration of a parent span is often independent of the parent span. In aync contextes the following traces are valid:

Case 1: Parent span ends after all child spans [parent 1 ] [child 1] [child 2]

Case 2: Parent span ends before child spans [parent 1 ] [child 1] [child 2]

Case 3: Parent span ends before child spans start [parent 1 ] [child 1] [child 2]

To make this asyncio fix DatadogContextMixin._update_active we will need to delay finishing a parent span until all spawned coroutines are started with the correct trace context. Python v3.12 introduced support for eagerly starting coroutines. This provides an API for executing coroutines as they're scheduled. Leveraging this API could allow us to support the span parenting in _update_active but this functionality does not exist in python<3.12 😞

tldr: Supporting this functionality on span finish and detecting when an async task starts executing could be tricky. I think it is best to provide this functionality on span start by patching the BaseEventLoop.

mabdinur avatar Mar 18 '24 14:03 mabdinur

So all three of asyncio.create_task, asyncio.ensure_future, and loop.create_task are gonna work?

How is this going to work with using a different event loop implementation, e.g. uvloop?

mklokocka avatar Mar 18 '24 17:03 mklokocka

@mabdinur Sorry to bother even after closing, but shouldn't the MR be patching AbstractEventLoop somehow instead to ensure compatibility with third party eventloops?

mklokocka avatar Mar 19 '24 02:03 mklokocka

@mklokocka Good point. Instrumenting the AbstractEventLoop.create_task instead of BaseEventLoop.create_task would be ideal but the ddtrace library does not support instrumenting abstract methods. The ddtrace libraries applies instrumentation to methods that are called, since AbstractEventLoop.create_task is unimplemented and therefore never called by subclasses it can't be traced.

We will need to introduce a new a set of instrumentation to support this functionality. For example introduce a helper that wraps the initialization of an object and ensures all instances of abstract classes have the expected instrumentation (however this will add significant overhead of creating objects). This is something we can look at as a separate feature request. I will create a card to track this. Thanks again for you insight 😄

mabdinur avatar Mar 19 '24 15:03 mabdinur