apm-agent-dotnet
apm-agent-dotnet copied to clipboard
[BUG] APM data including Transactions, Spans, CPU and Memory metrics are not sent when CPU utilization over 60%
I've recently updated apm .net core plugin from version 1.7.1 to 1.11.1 and after update I see metrics like this picture below. Other APM data (e.g. transactions) seem to be sent correctly.
APM Agent version
1.11.1+ca11381c8840632c65925393de57797871b359c5
Environment
Docker, Elastic Cloud, APM 7.14
.NET Framework/Core name and version NET Core 3.1.21 Application Target Framework(s) netcoreapp3.1
Describe the bug
Metrics are not sent to Elasticsearch. I don't however see rejections in the APM service so my guess is that they are not sent from agent.
To Reproduce
I don't have steps for reproduce now, I will work on the sample. It looks that when CPU utilization is high the queue of metric events doesn't flush as frequently as possible but this is only mu guess. I have set "MetricsInterval": "10s" in config.
Expected behavior
The metrics are continues lines.
Actual behavior
Metrics are not reported when they CPU above 60%.
Hi @kbalys,
experimenting with the MetricsInterval
is definitely a good idea - let us know if that changes anything, or you see any pattern with that.
Another thing that comes to my mind is that in 1.11.0
we introduced the "Time spent by span type" metrics - so that's an additional metrics which the agent collects. You can disable it by adding span.self_time
to the DisableMetrics
config - could you maybe check if that makes any difference?
As the last thing, it'd be also useful to look at agent logs - first on warning level, then set it to debug and finally to trace (debug and trace will likely generate even more overhead, so that could influence the test... nevertheless those detailed logs could be very useful).
Hi @gregkalapos,
I will investigate this more, however cannot reproduce it locally yet. When it comes to MetricsInterval I have this set to 10s in production.
The Time spent by span is working correctly the same as transaction data. See below the same timeframe, data seems to be consistent.
Hi, after some investigation we managed to create a sample app with reproduction of the bug: MissingMetrics.zip (our credentials were redacted from appsettings.json)
We have determined the reason for the missing metrics is thread starvation. In the sample app we use this code to execute background task:
ThreadPool.QueueUserWorkItem<object>(
callBack: s => ConsumeSingleCore(percentage, cancellationToken),
state: null,
preferLocal: false);
With the preferLocal
parameter set to false (which is default in some of the QueueUserWorkItem
method overloads) the metrics are missing when CPU usage is high. However, when the parameter is set to true, we don't observe this behavior:
To simulate high CPU usage we created CpuBurner class (present in the sample app) and burned CPU with 100 threads, 95% usage for 60 seconds. We tried setting log level for APM to Trace, but as soon as we start burning the CPU all logging stops.
Disabling span.self_time
makes no difference.
LMK if I can provide more information to help alleviate this issue.
Hi @gregkalapos, have you managed to reproduce the issue? Is any quick workaround possible?
Please let me know if you need more details.
HI, @gregkalapos, any news on this issue?
Hi All, @gregkalapos,
I think this issue is more serious than I thought at first time. Not only metrics are not delivered but also other data such as transactions and spans. Please compare the following two charts:
AWS load balancer statistics:
with
Whereas what I see in APM is:
The throughput shape for transactions in APM seems be be cut when CPU usage reaches its peak times. AWS diagram correctly depicts anticipated shape.
This is very serious problem as it looks like APM transaction data sent by plugin is not reliable.
I have created another project with reproduction, this time with minimal amount of code extracted from the agent, mainly BackendCommComponentBase and SingleThreadTaskScheduler. You can observe that when we start burning CPU all logging almost completely stops. ElasticIsFantastic.zip
Hi @gregkalapos, have you managed to reproduce the issue? Do you need anything more from us?
After our investigation we think the problem is thread pool starvation and the thread which sends apm data to server executes some async task which cannot run or complete if system is very busy and there are no free threads in thread pool. If we set ThreadPool.SetMinThreads(Int32, Int32) to high value or use preferLocal for thread pool QueueUserWorkItem, the metrics are sent correctly. However, in real application which uses a lot of libraries we cannot usually change it.
Ideally, the apm should use dedicated thread to execute its sending logic so we can get metrics and transaction also under a very high load.
Thanks for the reproducer - super useful, I see the problem locally as well.
Ideally, the apm should use dedicated thread to execute its sending logic so we can get metrics and transaction also under a very high load.
That was the original goal, which clearly isn't the case anymore. I'll try to rewrite this part to make sure we use 1 dedicated thread.
Hi @gregkalapos,
do you have any update regarding this issue? We desperately need a fix for it. Is it possible to prioritize it?
https://github.com/elastic/apm-agent-dotnet/pull/1670 made a change which'd fix the threading issue - with that PR, sending data will happen on a dedicated thread (as it should have been anyway).
That fix will be in the next release - I don't know when exactly the release happen, but it should be soon.
I keep this open to track if it fixed the issue or not.
Hi @gregkalapos,
do you possibly know when the next release is planned?
We just released a version which includes a PR that addresses this issue: https://github.com/elastic/apm-agent-dotnet/releases/tag/v1.15.0
Could someone maybe give it a try and let me know if the situation improved?
Hi @gregkalapos,
thank you for the release. Unfortunately, I have bad news. The 1.15 release doesn't solve the issue. Please see my diagram below. On the left 1.11.1 version and 1.15 on the right - the same problem is still there. Tested with agent: agent.version - 1.15.0+03d5f13cfbcb431ef0ee7cec6c88734dbf6d7540
https://github.com/elastic/apm-agent-dotnet/pull/1753 made some more changes around this which was released in 1.17.0
.
The main change is that we make sure the whole working loop when we process and sends events is sync and the agent doesn't create tasks and threads in the working loop if possible.
One thing to keep in mind here is that if the CPU is 100% (or close to that) utilized, then the single thread will likely not be able keep up with the work. If the problem still exists, then I'd suggest trying to tweak these settings:
-
MaxQueueEventCount: This is the most relevant config here - the default is
1000
, which means the agent will buffer 1k events, and then if the queue is full (e.g. because the agent is unable to send quickly enough - which I assume is the case here with the high CPU utilization) then the agent will start dropping events. If you increase this, the agent will buffer more events before it starts dropping those. This will result in higher peak memory usage - but potentially less dropped events. - MaxBatchEventCount: The default is 10 and each http request to APM server will contain this number of events set here. If there is high load and if lots of events are created, it's worth increasing this. The higher the number is the bigger the HTTP request to APM Server will be - on the up side for situation under high load, this will result in less HTTP calls - and potentially less CPU utilization.
- TransactionSampleRate: Decrease it, in order to capture less events.
- FlushInterval: Probably less relevant, but maybe it's worth trying to change this as well.
Could someone please give it a try and report back if the changes in 1.17.0
, plus the settings above helped?
Closing now - feel free to reopen.
Unfortunately, still not working on 1.17.
@gregkalapos I don't see an option to reopen this one, should I create a new issue?