newrelic-dotnet-agent
newrelic-dotnet-agent copied to clipboard
Unexpected thread usage increase
Description Something is causing web applications instrumented by the agent to periodically use roughly 500 threads instead of around 50-60 threads.
Expected Behavior We expect thread usage to increase when the agent is running because of the native threads that are required for certain .NET profiling API calls, metric sampling (CPU, Memory, Garbage Collection, Thread Info), sending data to New Relic, and continuation based async timing, but it should not increase the thread count by a large amount.
Troubleshooting or NR Diag results See this internal link for some of the troubleshooting that has been done previously to try discover what's causing the spike in thread usage. The link details which agent features and instrumentation that were disabled, as well as the agent settings that were changed, to try to minimize the increase in thread usage.
Steps to Reproduce N/A
Your Environment N/A
Additional context I've seen this thread explosion from time to time with some test applications during performance testing. However, I do not yet understand what is exactly triggering this explosion. Is it just a combination of circumstances such as: 1. A certain amount of load within the application, 2. A certain amount of load on the system running the application, 3. Samplers, Harvests, transactions completing, and async continuations needing to execute around the same time, 4. Is this happening after a bunch of work was delayed due to a blocking GC?
12/10/2021 - Let's do a sanity check first on this problem.
Based on my testing during Firefly, I was not able to see an increase as described.
I have left a test environment running with the infrastructure agent as described in this ticket and have a meeting scheduled with @nrcventura to review the results tomorrow before closing.
Here is the application under load: https://staging-one.newrelic.com/nr1-core/apm-nerdlets/overview/MjczMDcwfEFQTXxBUFBMSUNBVElPTnwxODk4NjI0Nw?account=273070&duration=10800000&filters=%28domain%20%3D%20%27APM%27%20AND%20type%20%3D%20%27APPLICATION%27%29&state=eea5a570-092c-9dfb-1d55-6a86f2c51aed
Here is the infrastructure query to show thread count for the process being instrumented/stressed: https://staging-one.newrelic.com/infra?account=273070&duration=10800000&state=33915c44-c65d-902c-69f3-19d44b8a6c4d
It is processing about 8k requests per minute, so we should see anything bad that the agent is doing
Summary of Findings
My general observation is that the .NET runtime will create as many threads as the OS can spare in high throughput applications. This increase does occur more quickly with the .NET Agent attached to an application, but I did not observe radically different maximum thread counts with/without the agent attached over time.
After reviewing multiple performance counters related to overall application performance/CPU usage, my assertion is that the root cause of this issue relates to lock contention introduced by the agent. Without the agent attached, I experienced a maximum of <100 contested locks per second in a high throughput application. With the agent attached, I experienced lock contention peak >30,000 per second. This metric has been also flagged as out of spec for high performance applications by Microsoft when customers have engaged them to troubleshoot application performance.
The agent makes heavy use of interlocked counters, and explicit locks to guard collections. If the lock contention could be reduced, I believe there would be less performance overhead.
This Issue has been marked stale after 90 days with no activity. It will be closed in 30 days if there is no activity.
https://issues.newrelic.com/browse/NEWRELIC-5587
Jira CommentId: 118742 Commented by chynes:
We can use this as a spike for general thread contention/related performance issues
Closed as not planned based on the findings from Josh on Feb 9.