dd-trace-dotnet
dd-trace-dotnet copied to clipboard
Increased CPU usage after upgrading to 2.11.0
Describe the bug After I have upgraded datadog from version 2.4.2 to version 2.11.0 I have noticed after 3 weeks significant increase of CPU usage by my service. After downgrading to version 2.4.2 the problem disappeared . I have managed to do experiment with version 2.9.0 and the behaviour was the same as 2.4.2. So the problem might be 2.10.0 or 2.11.0.
Screenshots


Runtime environment (please complete the following information):
- Docker
- Tracer version: 2.11.0
- OS: Linux (Debian 11)
- Net 6
- ~~Nuget~~
Maybe the increase of CPU is related to the fact that you have added CPU and Exception profiler in the version 2.10.0
Hello @lukashovancik , thanks for reporting the issue. We'll look on our end to see if we reproduce easily.
Regarding your setup, when you specify Nuget
, do you mean you use custom instrumentation (ie Datadog.Trace
), or automatic instrumentation using the beta nuget Datadog.Monitoring.Distribution
? In the first case, would you happen to have installed the tracer on the server as well for automatic instrumentation, and in which case, what version are you using?
Also, as you mentioned, CPU and Exception profiler, have you enabled them by setting DD_PROFILING_ENABLED
?
Thanks
Hello @pierotibou ,
actually it's not a nuget package but Datadog .Net tracer installed on Debian . Yes we do have profiling enabled.
ENV DD_PROFILING_ENABLED=true
ENV DD_RUNTIME_METRICS_ENABLED=true
ENV CORECLR_ENABLE_PROFILING=1
ENV CORECLR_PROFILER={846F5F1C-F9AE-4B07-969E-05C26BC060D8}
ENV CORECLR_PROFILER_PATH=/opt/datadog/Datadog.Trace.ClrProfiler.Native.so
ENV DD_INTEGRATIONS=/opt/datadog/integrations.json
ENV DD_DOTNET_TRACER_HOME=/opt/datadog
Thank you.
DD_PROFILING_ENABLED
has been enabled also for 2.4.2
DD_PROFILING_ENABLED
has been enabled also for2.4.2
The continuous profiler for Linux was first made available with 2.10.0, so this setting would have no effect on previous versions.
It seems very likely that the CP is the cause of the overhead, but we never observed more than ~5% increase of CPU in our internal test environments. There's probably something specific about your setup that we're not testing.
Would it be possible for you to test two more scenarios?
- Set
DD_PROFILING_ENABLED
tofalse
- Set
DD_PROFILING_ENABLED
totrue
andDD_PROFILING_CODEHOTSPOTS_ENABLED
tofalse
Meanwhile we're trying to repro the issue in our environments. Any information that you're willing to share about your app could help (number of CPU cores, memory usage, number of threads...).
Also, could you open a zendesk support ticket? It will make things easier if we need to share private information.
@kevingosse I can do both scenarios and open a support ticket with more informations about infrastructure.
Just to circle back on this, what is the size of your containers? The profiler currently has a fixed overhead (about 200ms of CPU time per second on Linux, or 0.2 CPU). This is fine in most cases, but it can become very significant in small containers. For instance in a container with a limit of 1 CPU, that would be 20%.
We are working on making that overhead adaptive in the coming versions, but for now that's something to consider when enabling the profiler.
Here are some results: (version 2.11.0)
-
ENV DD_PROFILING_ENABLED=false

-
ENV DD_PROFILING_ENABLED=true
ANDENV DD_PROFILING_CODEHOTSPOTS_ENABLED=false


@kevingosse Sorry for waiting but I didn't have time to come back to you.
Indeed our CPU size is relatively low in containers and I see your point . Anyway looks like that with the disabled profiler is not an issue anymore.
There is one more thing. I have noticed on production that sometimes the thread count of the application went drastically up. After downgrading Datadog to lower version it has never occurred again. The reason why you see only one spike on the screenshot bellow is that we have managed to prevent other spikes by restarting the container after the monitor has triggered alert about the high thread count.

@lukashovancik for the thread spike issue, could it be that you didn't set the LD_PRELOAD
environment variable?
LD_PRELOAD=/opt/datadog/continuousprofiler/Datadog.Linux.ApiWrapper.x64.so
If it's omitted then it can cause some deadlocks, which would explain the thread spike. Starting with 2.13.0 we automatically disable the profiler if the variable is not properly set, to prevent this kind of issue.
Experiencing the same thing. The profiler is adding significant CPU usage to containers, in the range of 40-60%.
@kevingosse yes, I didn't set the LD_PRELOAD variable. So maybe that's the reason.
@lukashovancik To circle back on this matter, how is it going currently ? which version are you using ?
I'm going to close this one for now, feel free to reopen if you're still seeing issues with recent versions!