dd-trace-py thread count keeps growing using ddtrace==2.14.2

Hello everyone,

we are using ddtrace in production on our celery workers, and those seem to be creating extra threads

we had the issue when we initially upgraded all our unpinned dependencies, so we rolled back and I tested it one by one, the celery version is 5.3.6, and we also use celery-beats, and our worker pool should be multiprocessing.

our python version is 3.11 by the way,

let me know if i can give additional details or if you can’t reproduce,

Regards

Oct 11 '24 12:10 daillouf

👋 thank you for the report!

Can you confirm the ddtrace version which is causing issues?

Also, I can't quite see the top of the dashboard, but it looks like this is a view of Docker Threads--is that correct? Would you be willing to share whether there is also an increase in the total number of processes (PIDs) in the container? This might help us narrow what category of behaviors to look at.

Oct 11 '24 12:10 sanchda

Not sure if related, but after upgrading to Python 3.12 and ddtrace==2.14.2 with profiling enabled I started to observe increased CPU utilization and profiler has been showing thousands of exceptions being thrown.

After I disabled profiling the CPU consumption has decreased:

Thrown exceptions during same time frame:

Oct 21 '24 16:10 inikolaev

I'm also observing the same CPU utilization and thrown exception symptoms. The thrown exceptions symptom I'm seeing in my setup is identical-- it's similarly being attributed to _worker at thread.py:L#89 and the _queue.Empty exception type.

I've tried various different ddtrace library versions (2.3.0, 2.8.0, 2.14.0, and 2.17.0) and none seem to have fixed. I've also tried setting DD_PROFILING_STACK_V2_ENABLED=false per other recommendations, and experimented with DD_PROFILING_EXPORT_LIBDD_ENABLED=false as well.

Looking for any recommendation or resolution, thanks!

Dec 17 '24 19:12 micahjz

@micahjz thank you for the report. I received another report about this today; investigating.

Just to double-check, when you wrote

DD_PROFILING_STACK_V2_ENABLED=false

Did you mean

DD_PROFILING_STACK_V2_ENABLED=true?

Generally, for overhead-related issues our recommendation has been to enable the "stack v2" feature. That said, stack v2 does not yet support exception profiles, so part of the symptom will be obscured by this feature.

Dec 17 '24 19:12 sanchda

Hi @sanchda .

Ah perhaps I had misread the recommendation in the ticket here https://github.com/DataDog/dd-trace-py/issues/9205 to address segmentation faults.

I've re-enabled with DD_PROFILING_STACK_V2_ENABLED=true. The thrown exceptions are gone as you indicated. Although the performance doesn't degrade significantly, I still see some performance degradations switching to ddtrace==2.17.0 with DD_PROFILING_STACK_V2_ENABLED=true.

The performance degradation symptoms are:

This is using python 3.12.7 in a testing environment with no traffic. At the :53 minute mark, we switch to use ddtrace==2.17.0 with DD_PROFILING_STACK_V2_ENABLED=true.

From image top to bottom,

higher user and system CPU utilization. It's hard to see, but the system CPU utilization changes from <10ms to ~50ms, an almost 5x increase.
increase in overall CPU utilization percentage from 2% to 3-4% (despite no traffic for this service in the testing environment)
increase in context switches, especially involuntary
increase in p90 latency, which is probably measuring various spans and health checks etc.

Dec 17 '24 21:12 micahjz

@micahjz thank you for checking it out and the comprehensive report. I think there might be something actionable here, but I need to run some of my own experiments to see what I can confirm.

Just to verify--do you know whether you're observing the issue where the total thread count is increasing when profiling is enabled?

Dec 17 '24 21:12 sanchda

I don't believe I'm seeing the total thread count symptom. Thanks for the assistance here @sanchda!

Dec 17 '24 22:12 micahjz

@micahjz are there any details you can share about the structure of your application? Is it using threads? Tasks? One short-lived thread per request? One short-lived ConcurrentFuture worker per request? A static worker pool? A single process? Any insight will help us investigate.

To be clear, generally speaking we're pretty comfortable having a small overhead--like the kind you observe--for CPU metrics. Even an increase in involuntary context switches is something we might be able to explain and admit.

I'm very confused by the 2-3ms increase in p90 latency for an idle process. I can't really explain that. I doubt you're running a configuration where you have a CPU quota tuned very carefully to the baseline of an idle process, so it's not like the extra CPU time is time-slicing (e.g., at the level of the Linux scheduler) with the request handler. The v2 stack sampler primarily runs in a separate native thread, so a lot of this overhead does not serialize with the cpython runtime (unlike the "v1" stack sampler, which needed to take the GIL in order to perform a sample).

Dec 17 '24 22:12 sanchda

The application is a FastAPI / uvicorn web server within a single process. It uses asyncio as the event loop. Each endpoint is an async function, and every request should be handled by a short-lived coroutine. We generally don't use any multithreading. Infrequently (and not in an idle server) we use the asyncio run_in_executor functionality, which will run something in a separate thread within a ThreadPoolExecutor. Otherwise, everything should run within a single process and single thread.

Dec 17 '24 22:12 micahjz

That said, the service page on datadog for this service in our testing environment indicates a sum total of ~400 threads for the 3 running pods

Dec 17 '24 22:12 micahjz

@sanchda I wanted to add this clarification just in case I wasn't clear. In the details and screenshots above, we were already running with profiling enabled. At the 53 minute mark, we change from using ddtrace==1.8.0 to ddtrace==2.17.0 with DD_PROFILING_STACK_V2_ENABLED=true. When you say some overhead is expected, I want to make sure its clear that we were already running with profiling enabled, we were just upgrading the version of ddtrace. I'm not surprised there's overhead running profiling, I'm surprised that there's this additional degree of overhead to run profiling when upgrading from ddtrace==1.8.0 to ddtrace==2.17.0. Is this additional overhead penalty expected?

Dec 18 '24 14:12 micahjz

I want to make sure its clear that we were already running with profiling enabled

Appreciate the clarification--thank you--I misunderstood the original message to mean that the overhead was relative to baseline.

No, this is very much not our expectation for the stack v2 feature.

Dec 18 '24 15:12 sanchda

This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.

May 18 '25 00:05 github-actions[bot]