dd-trace-dotnet Increased CPU usage after upgrading to 2.11.0

Describe the bug After I have upgraded datadog from version 2.4.2 to version 2.11.0 I have noticed after 3 weeks significant increase of CPU usage by my service. After downgrading to version 2.4.2 the problem disappeared . I have managed to do experiment with version 2.9.0 and the behaviour was the same as 2.4.2. So the problem might be 2.10.0 or 2.11.0.

Screenshots

Runtime environment (please complete the following information):

Docker
Tracer version: 2.11.0
OS: Linux (Debian 11)
Net 6
~~Nuget~~

Jul 20 '22 14:07 lukashovancik

Maybe the increase of CPU is related to the fact that you have added CPU and Exception profiler in the version 2.10.0

Jul 20 '22 15:07 lukashovancik

Hello @lukashovancik , thanks for reporting the issue. We'll look on our end to see if we reproduce easily.

Regarding your setup, when you specify Nuget, do you mean you use custom instrumentation (ie Datadog.Trace), or automatic instrumentation using the beta nuget Datadog.Monitoring.Distribution ? In the first case, would you happen to have installed the tracer on the server as well for automatic instrumentation, and in which case, what version are you using?

Also, as you mentioned, CPU and Exception profiler, have you enabled them by setting DD_PROFILING_ENABLED ?

Thanks

Jul 21 '22 07:07 pierotibou

Hello @pierotibou ,

actually it's not a nuget package but Datadog .Net tracer installed on Debian . Yes we do have profiling enabled.

ENV DD_PROFILING_ENABLED=true
ENV DD_RUNTIME_METRICS_ENABLED=true

ENV CORECLR_ENABLE_PROFILING=1
ENV CORECLR_PROFILER={846F5F1C-F9AE-4B07-969E-05C26BC060D8}
ENV CORECLR_PROFILER_PATH=/opt/datadog/Datadog.Trace.ClrProfiler.Native.so
ENV DD_INTEGRATIONS=/opt/datadog/integrations.json
ENV DD_DOTNET_TRACER_HOME=/opt/datadog

Thank you.

Jul 21 '22 09:07 lukashovancik

DD_PROFILING_ENABLED has been enabled also for 2.4.2

Jul 21 '22 09:07 lukashovancik

DD_PROFILING_ENABLED has been enabled also for 2.4.2

The continuous profiler for Linux was first made available with 2.10.0, so this setting would have no effect on previous versions.

It seems very likely that the CP is the cause of the overhead, but we never observed more than ~5% increase of CPU in our internal test environments. There's probably something specific about your setup that we're not testing.

Would it be possible for you to test two more scenarios?

Set DD_PROFILING_ENABLED to false
Set DD_PROFILING_ENABLED to true and DD_PROFILING_CODEHOTSPOTS_ENABLED to false

Meanwhile we're trying to repro the issue in our environments. Any information that you're willing to share about your app could help (number of CPU cores, memory usage, number of threads...).

Also, could you open a zendesk support ticket? It will make things easier if we need to share private information.

Jul 21 '22 11:07 kevingosse

@kevingosse I can do both scenarios and open a support ticket with more informations about infrastructure.

Jul 21 '22 13:07 lukashovancik

Just to circle back on this, what is the size of your containers? The profiler currently has a fixed overhead (about 200ms of CPU time per second on Linux, or 0.2 CPU). This is fine in most cases, but it can become very significant in small containers. For instance in a container with a limit of 1 CPU, that would be 20%.

We are working on making that overhead adaptive in the coming versions, but for now that's something to consider when enabling the profiler.

Aug 02 '22 10:08 kevingosse

Here are some results: (version 2.11.0)

ENV DD_PROFILING_ENABLED=false

ENV DD_PROFILING_ENABLED=true AND ENV DD_PROFILING_CODEHOTSPOTS_ENABLED=false

@kevingosse Sorry for waiting but I didn't have time to come back to you.

Aug 03 '22 19:08 lukashovancik

Indeed our CPU size is relatively low in containers and I see your point . Anyway looks like that with the disabled profiler is not an issue anymore.

There is one more thing. I have noticed on production that sometimes the thread count of the application went drastically up. After downgrading Datadog to lower version it has never occurred again. The reason why you see only one spike on the screenshot bellow is that we have managed to prevent other spikes by restarting the container after the monitor has triggered alert about the high thread count.

Aug 03 '22 19:08 lukashovancik

@lukashovancik for the thread spike issue, could it be that you didn't set the LD_PRELOAD environment variable?

LD_PRELOAD=/opt/datadog/continuousprofiler/Datadog.Linux.ApiWrapper.x64.so

If it's omitted then it can cause some deadlocks, which would explain the thread spike. Starting with 2.13.0 we automatically disable the profiler if the variable is not properly set, to prevent this kind of issue.

Aug 22 '22 10:08 kevingosse

Experiencing the same thing. The profiler is adding significant CPU usage to containers, in the range of 40-60%.

Sep 20 '22 20:09 seanamos

@kevingosse yes, I didn't set the LD_PRELOAD variable. So maybe that's the reason.

Sep 21 '22 10:09 lukashovancik

@lukashovancik To circle back on this matter, how is it going currently ? which version are you using ?

Feb 20 '23 14:02 gleocadie

I'm going to close this one for now, feel free to reopen if you're still seeing issues with recent versions!

Aug 31 '23 11:08 andrewlock

dd-trace-dotnet dd-trace-dotnet copied to clipboard

Increased CPU usage after upgrading to 2.11.0

dd-trace-dotnet
dd-trace-dotnet copied to clipboard