apm-agent-dotnet icon indicating copy to clipboard operation
apm-agent-dotnet copied to clipboard

[BUG] APM data including Transactions, Spans, CPU and Memory metrics are not sent when CPU utilization over 60%

Open kbalys opened this issue 3 years ago • 15 comments

I've recently updated apm .net core plugin from version 1.7.1 to 1.11.1 and after update I see metrics like this picture below. Other APM data (e.g. transactions) seem to be sent correctly. image

APM Agent version

1.11.1+ca11381c8840632c65925393de57797871b359c5

Environment

Docker, Elastic Cloud, APM 7.14

.NET Framework/Core name and version NET Core 3.1.21 Application Target Framework(s) netcoreapp3.1

Describe the bug

Metrics are not sent to Elasticsearch. I don't however see rejections in the APM service so my guess is that they are not sent from agent.

To Reproduce

I don't have steps for reproduce now, I will work on the sample. It looks that when CPU utilization is high the queue of metric events doesn't flush as frequently as possible but this is only mu guess. I have set "MetricsInterval": "10s" in config.

Expected behavior

The metrics are continues lines.

Actual behavior

Metrics are not reported when they CPU above 60%.

kbalys avatar Nov 16 '21 13:11 kbalys

Hi @kbalys,

experimenting with the MetricsInterval is definitely a good idea - let us know if that changes anything, or you see any pattern with that.

Another thing that comes to my mind is that in 1.11.0 we introduced the "Time spent by span type" metrics - so that's an additional metrics which the agent collects. You can disable it by adding span.self_time to the DisableMetrics config - could you maybe check if that makes any difference?

As the last thing, it'd be also useful to look at agent logs - first on warning level, then set it to debug and finally to trace (debug and trace will likely generate even more overhead, so that could influence the test... nevertheless those detailed logs could be very useful).

gregkalapos avatar Nov 16 '21 14:11 gregkalapos

Hi @gregkalapos,

I will investigate this more, however cannot reproduce it locally yet. When it comes to MetricsInterval I have this set to 10s in production.

The Time spent by span is working correctly the same as transaction data. See below the same timeframe, data seems to be consistent. image image

kbalys avatar Nov 16 '21 15:11 kbalys

Hi, after some investigation we managed to create a sample app with reproduction of the bug: MissingMetrics.zip (our credentials were redacted from appsettings.json)

We have determined the reason for the missing metrics is thread starvation. In the sample app we use this code to execute background task:

ThreadPool.QueueUserWorkItem<object>(
                    callBack: s => ConsumeSingleCore(percentage, cancellationToken), 
                    state: null, 
                    preferLocal: false);

With the preferLocal parameter set to false (which is default in some of the QueueUserWorkItem method overloads) the metrics are missing when CPU usage is high. However, when the parameter is set to true, we don't observe this behavior: image

To simulate high CPU usage we created CpuBurner class (present in the sample app) and burned CPU with 100 threads, 95% usage for 60 seconds. We tried setting log level for APM to Trace, but as soon as we start burning the CPU all logging stops.

Disabling span.self_time makes no difference. LMK if I can provide more information to help alleviate this issue.

skuc-fp avatar Dec 02 '21 11:12 skuc-fp

Hi @gregkalapos, have you managed to reproduce the issue? Is any quick workaround possible?

Please let me know if you need more details.

kbalys avatar Dec 15 '21 22:12 kbalys

HI, @gregkalapos, any news on this issue?

skuc-fp avatar Jan 17 '22 09:01 skuc-fp

Hi All, @gregkalapos,

I think this issue is more serious than I thought at first time. Not only metrics are not delivered but also other data such as transactions and spans. Please compare the following two charts: AWS load balancer statistics: image with

Whereas what I see in APM is: image

The throughput shape for transactions in APM seems be be cut when CPU usage reaches its peak times. AWS diagram correctly depicts anticipated shape.

This is very serious problem as it looks like APM transaction data sent by plugin is not reliable.

kbalys avatar Jan 20 '22 10:01 kbalys

I have created another project with reproduction, this time with minimal amount of code extracted from the agent, mainly BackendCommComponentBase and SingleThreadTaskScheduler. You can observe that when we start burning CPU all logging almost completely stops. ElasticIsFantastic.zip

skuc-fp avatar Jan 27 '22 13:01 skuc-fp

Hi @gregkalapos, have you managed to reproduce the issue? Do you need anything more from us?

After our investigation we think the problem is thread pool starvation and the thread which sends apm data to server executes some async task which cannot run or complete if system is very busy and there are no free threads in thread pool. If we set ThreadPool.SetMinThreads(Int32, Int32) to high value or use preferLocal for thread pool QueueUserWorkItem, the metrics are sent correctly. However, in real application which uses a lot of libraries we cannot usually change it.

Ideally, the apm should use dedicated thread to execute its sending logic so we can get metrics and transaction also under a very high load.

kbalys avatar Feb 01 '22 10:02 kbalys

Thanks for the reproducer - super useful, I see the problem locally as well.

Ideally, the apm should use dedicated thread to execute its sending logic so we can get metrics and transaction also under a very high load.

That was the original goal, which clearly isn't the case anymore. I'll try to rewrite this part to make sure we use 1 dedicated thread.

gregkalapos avatar Feb 16 '22 22:02 gregkalapos

Hi @gregkalapos,

do you have any update regarding this issue? We desperately need a fix for it. Is it possible to prioritize it?

kbalys avatar Mar 04 '22 14:03 kbalys

https://github.com/elastic/apm-agent-dotnet/pull/1670 made a change which'd fix the threading issue - with that PR, sending data will happen on a dedicated thread (as it should have been anyway).

That fix will be in the next release - I don't know when exactly the release happen, but it should be soon.

I keep this open to track if it fixed the issue or not.

gregkalapos avatar Mar 30 '22 09:03 gregkalapos

Hi @gregkalapos,

do you possibly know when the next release is planned?

kbalys avatar May 11 '22 11:05 kbalys

We just released a version which includes a PR that addresses this issue: https://github.com/elastic/apm-agent-dotnet/releases/tag/v1.15.0

Could someone maybe give it a try and let me know if the situation improved?

gregkalapos avatar May 12 '22 12:05 gregkalapos

Hi @gregkalapos,

thank you for the release. Unfortunately, I have bad news. The 1.15 release doesn't solve the issue. Please see my diagram below. On the left 1.11.1 version and 1.15 on the right - the same problem is still there. Tested with agent: agent.version - 1.15.0+03d5f13cfbcb431ef0ee7cec6c88734dbf6d7540

image

kbalys avatar May 19 '22 12:05 kbalys

https://github.com/elastic/apm-agent-dotnet/pull/1753 made some more changes around this which was released in 1.17.0.

The main change is that we make sure the whole working loop when we process and sends events is sync and the agent doesn't create tasks and threads in the working loop if possible.

One thing to keep in mind here is that if the CPU is 100% (or close to that) utilized, then the single thread will likely not be able keep up with the work. If the problem still exists, then I'd suggest trying to tweak these settings:

  • MaxQueueEventCount: This is the most relevant config here - the default is 1000, which means the agent will buffer 1k events, and then if the queue is full (e.g. because the agent is unable to send quickly enough - which I assume is the case here with the high CPU utilization) then the agent will start dropping events. If you increase this, the agent will buffer more events before it starts dropping those. This will result in higher peak memory usage - but potentially less dropped events.
  • MaxBatchEventCount: The default is 10 and each http request to APM server will contain this number of events set here. If there is high load and if lots of events are created, it's worth increasing this. The higher the number is the bigger the HTTP request to APM Server will be - on the up side for situation under high load, this will result in less HTTP calls - and potentially less CPU utilization.
  • TransactionSampleRate: Decrease it, in order to capture less events.
  • FlushInterval: Probably less relevant, but maybe it's worth trying to change this as well.

Could someone please give it a try and report back if the changes in 1.17.0, plus the settings above helped?

gregkalapos avatar Sep 19 '22 09:09 gregkalapos

Closing now - feel free to reopen.

gregkalapos avatar Sep 28 '22 15:09 gregkalapos

Unfortunately, still not working on 1.17.

image

image

kbalys avatar Sep 29 '22 21:09 kbalys

@gregkalapos I don't see an option to reopen this one, should I create a new issue?

kbalys avatar Sep 29 '22 21:09 kbalys