newrelic-dotnet-agent icon indicating copy to clipboard operation
newrelic-dotnet-agent copied to clipboard

Performance problems with calls to Assembly::LoadFrom() from instrumented methods

Open mdomashchenko opened this issue 3 years ago • 2 comments

This call gets injected into the beginning of every instrumented method: https://github.com/newrelic/newrelic-dotnet-agent/blob/813b8210523c2c50942df3da119f4008a86a25dd/src/Agent/NewRelic/Profiler/MethodRewriter/FunctionManipulator.h#L269

In our case SqlCommand.ExecuteReaderAsync and HttpClient.SendAsync which are executed constantly.

Unfortunately, that quickly hits this lock statement in AssemblyLoadContext https://github.com/dotnet/runtime/blob/70d20f1445127d6473283047d0a551ccf85f897f/src/libraries/System.Private.CoreLib/src/System/Runtime/Loader/AssemblyLoadContext.cs#L342

And looks like from time to time it's causing a lock convoy (https://devblogs.microsoft.com/oldnewthing/20061212-00/?p=28753)

I do have a dump file with 77 threads waiting on that _unloadLock object and no owner.

Could you, please, cache those types instead of loading them for every execution of every instrumented method?

mdomashchenko avatar Dec 02 '21 21:12 mdomashchenko

This has a similar root cause to issue #533, which is currently marked closed, I believe pending feedback from the customer who was having the problem in that case.

Regarding caching the types, we do in fact do this for .NET Framework applications, but it just moves the problem somewhere else.

Ideally we should find a solution to this problem that works for both the framework and core versions of the agent.

nr-ahemsath avatar Dec 03 '21 18:12 nr-ahemsath

https://issues.newrelic.com/browse/NEWRELIC-3670

Jira CommentId: 127624 Commented by chynes:

This is a difficult problem to solve. With the current architecture design, when a call is made to the Agent API or an instrumented function is called, our Profiler (the native component) needs to dynamically load our Core DLL in order to find the right methods to inject into the target application. Our code supports two different ways of doing this:

  1. Cache the method info in AppDomain. (Note that this solution is currently only implemented for .NET Framework.) This reduces the number of assembly loads, which are expensive. However, in applications with lots of concurrent threads calling into the Agent, the threads all need to fight over the same cache lock, resulting in slowdown.
  2. Load our Core DLL to do the method lookup every time it's needed. This allows each thread to operate independently without synchronization. However, in applications with few threads but that call the same instrumented functions in large volumes, all the assembly loads add up, resulting in slowdown.

Different customer environments may exhibit one scenario or the other, making a general solution difficult.

What I hoped would work was to use Thread Local Storage for the cache. This would allow applications with lots of concurrent threads to not share resources, while applications that call the same function in high volume benefit from using a cache. While I was successful in getting this to work in a test app, there are limitations to what the Profiler can do in terms of injecting functionality into the target application that I have so far been unable to solve.

My backup plan was to use an environment variable to choose which solution to use. However, because the caching solution is not implemented for .NET Core, there is a nontrivial amount of work to do, and would require changing how the Profiler works on .NET Core, which adds to the risk.

Jira CommentId: 128214 Commented by chynes:

I have not been able to make progress on this. The mechanism we use to inject code into applications is different for .NET Framework and .NET Core applications, and my attempts to duplicate the caching code used for .NET Framework applications have been unsuccessful. Due to the complexity of the Profiler, I am not certain if there is a technical limitation that makes this impossible, or if my approach is incorrect.

There may be a path forward I'm unable to see, but it's unlikely we will be able to solve this without significant additional research.

@mdomashchenko Thank you again for bringing this issue to our attention. We've spiked on the problem space; however, we will need more time to continue to hone in on the solution. An additional spike was created to have the devs continue to make forward progress. We are closing this ticket as won't-fix until we have the spike resolved. The spike is planned for this quarter.

angelatan2 avatar Jan 24 '23 19:01 angelatan2

Hello Team,

Is there any further progress on this issue? off late, we have been observing issues with our apps too.

Thank you for checking in. We have done additional research and believe we have an approach that would address the issue in .NET Framework. There is still some additional work needed to clean up the prototype and get it to production quality. As for .NET Core/.NET, the approach we have in mind might also work, but since the two are very different we would need to dig in deeper with an additional spike.

We don't currently have the .NET Core/.NET spike or the .NET Framework prototype work scheduled, though it is on our radar.

jaffinito avatar Jun 13 '23 22:06 jaffinito