apm-agent-dotnet icon indicating copy to clipboard operation
apm-agent-dotnet copied to clipboard

Experiencing a Memory Leak

Open sluebbert opened this issue 5 years ago • 10 comments

Describe the bug Several, but not all, tested .NET Framework services have what appears to be a memory leak.

To Reproduce Steps to reproduce the behavior:

  1. Grab existing .NET Framework WebAPI service.
  2. Install nuget package Elastic.Apm.AspNetFullFramework 1.3.0
  3. Add to the web config modules section: <add name="ElasticApmModule" type="Elastic.Apm.AspNetFullFramework.ElasticApmModule, Elastic.Apm.AspNetFullFramework" />
  4. Add to the web config app settings section: <add key="ElasticApm:ServerUrls" value="http://apmhost:8200" /> <add key="ElasticApm:ServiceName" value="ServiceName" /> <add key="ElasticApm:ServiceVersion" value="2020.02.25.01" /> <add key="ElasticApm:Environment" value="Production" /> <add key="ElasticApm:TransactionSampleRate" value="0.25" />
  5. On controllers methods that have parameters in the path, on first line of body, update transaction name to have the parameters replaced to hard-coded text to cause grouping to work better.
  6. Deploy service to IIS 10 server 2016
  7. Send 10 - 20 requests per minute to service

Expected behavior Memory usage to be unchanged or bumped a bit higher, not unstopped growth of usage.

I have tried to recreate this issue in our dev environment and have even tried using 1000s of requests a minute to speed things up but have not been able to recreate. We are experiencing this on 2 of the 3 .net framework services we have in production. 2 .net core services in production don't seem to be having the issue. We have since pulled back 1 of the 2 from production and just today tried the 2nd one in production to confirm the results.

The one deployed today only had the steps above as the single changeset.

No lag in traffic to the APM server is observed. All transactions seem to be flowing to the APM server quickly and hit rates or transactions showing up there are as expected. So nothing is jumping out at me like stuff is getting clogged. I thought of #387, but wouldn't that only be applicable if events were clogged?

As of typing this issue out, the remaining prod service has a memory footprint of over 2gb. IIS Worker Pool Snapshot: iisMemoryUsage The row highlighted represents the service we deployed today. The yellow highlight a few rows below represents the other .net framework service that has been out there for about a week now with no issues. The service does not share its app pool. Resetting the service or app pool immediately corrects / resets the memory issue.

Kibana server memory usage: memoryTrend The line drawn is when we promoted the problem service to production.

We first encountered this issue last Friday and rolled-back the service to investigate. We were tipped off by high memory usage on the server, CPU spiking, and timeouts occurring for clients trying to talk to that service. Logs in APM from that service at that time had some spans showing 36 seconds and higher even though other external data sources indicated the duration of those spans should have been much lower (under a second). Rolling back resolved the issue.

I'll continue trying stuff out in our dev environment to try to recreate it there.

Please advise

sluebbert avatar Feb 26 '20 22:02 sluebbert

It is interesting that even commenting out the module in the web.config <add name="ElasticApmModule" type="Elastic.Apm.AspNetFullFramework.ElasticApmModule, Elastic.Apm.AspNetFullFramework" /> still results in a memory leak. As expected, no transactions or metrics get sent to the apm server when this happens.

I'm guessing the existence of the apm agent and diagnostic dlls loaded in memory (due to us asking for current transactions in code) is continuing to cause this? Is that even possible? Is there static initialization happening for example?

I'll keep poking around.

sluebbert avatar Feb 27 '20 21:02 sluebbert

We have a few other working services now to compare with. Services that are not hogging up memory are those that just enable the apm agent and that's it.

Each service that grabs up a ton of memory over time are those that access the public transaction or span api. For example:

var tran = Agent.Tracer.CurrentTransaction;
if (tran != null)
	tran.Name = tran.Name.Replace(accountNumber, "{accountNumber}");

That code is called even if the module is removed from the web.config, so I wonder if that narrows it down.

sluebbert avatar Feb 27 '20 22:02 sluebbert

Thanks for all the info @sluebbert.

That code is called even if the module is removed from the web.config, so I wonder if that narrows it down.

Yes, it narrows it down, at least it gives me some ideas. That code gets called and that means the agent is indeed initialized - for example it's going to collect metrics plus of course all the transactions and spans you capture manually.

I'll dig deeper into this.

gregkalapos avatar Feb 28 '20 09:02 gregkalapos

Thanks for checking it out.

I seemed to have recreated it from creating a new .net framework webapi service within this repo's source code. I just:

  • added in a new service, added references to the required projects
  • did not add the module reference to the web config
  • added the above code snippet to the ValuesController.Get()
  • sent 10 requests a second for several hours to the controller. -I'm not sure rate of requests impacts anything.

After a couple hours the private bytes reported by Visual Studio's diagnostic tools shows 230mb in use. I think I remember that starting at about ~110mb. I can't see that far back and don't want to start over yet on the test.

diag

I'll leave it running for now to see if it keeps growing. I'm not experienced with these tools and rarely watch memory usage this closely so I don't know how much growth is normal in debug mode.

[EDIT] As of now private bytes is up to 356mb almost 8 hours in.

sluebbert avatar Feb 28 '20 17:02 sluebbert

Hi @sluebbert,

I'm quite sure this is related to GC metrics collection.

Workaround

e.g.: set ELASTIC_APM_DISABLE_METRICS=clr.gc.*

For now, I'd say the easiest would be to just disable GC metrics collection for the windows .NET Framework apps where this happen. For doing this you can use the DisableMetrics setting. Easiest is to just set the ELASTIC_APM_DISABLE_METRICS environment variable. Here is the list of all metrics - if you disable everything that is related to GC (that's what matches clr.gc.*) then the agent won't start the ETW session. After some testing that seems to fix the problem.

Side-effect of the workaround

With this you won't have GC metrics. For GC metrics we don't have a built-in chart yet, so unless you created a chart for GC metrics yourself, you should not see any difference - except of course not having the memory leak - that should go away.

Some more details

This also explains why you only see this on Full Framework and not on .NET Core - on Full Framework we use ETW, on .NET Core we collect GC metrics with EventSource.

I'll work on a fix - that'll be of course the long term fix for this issue.

Sorry about the trouble caused by this.

gregkalapos avatar Mar 04 '20 18:03 gregkalapos

@gregkalapos,

Thanks for the quick responses. I added that setting to the web config for one of our impacted services and it seems to have done the trick. I haven't seen any abnormal growth for several hours now.

I'll keep an eye on it over the next few days to verify.

sluebbert avatar Mar 04 '20 20:03 sluebbert

#750 with the fix is already on master - I'll keep this open for the next days to make sure all good - if so, then I'll close and the fix will be in the next release.

gregkalapos avatar Mar 05 '20 17:03 gregkalapos

A few days later and still no issues with memory.

sluebbert avatar Mar 06 '20 13:03 sluebbert

Hi @sluebbert,

a new agent version (version 1.4.0) just went out and it contains the fix for this issue. With that you could remove the ELASTIC_APM_DISABLE_METRICS setting, and with the default configs everything should be fine.

If you have some time, could you maybe verify that this is indeed the case with your app?

I'm closing this now - feel free to just respond here. If something is not ok, we can reopen this issue anytime.

gregkalapos avatar Mar 20 '20 11:03 gregkalapos

Hi,

I experience the same error using Elastic.Apm.AspNetFullFramework 1.12.1 in two .NET Framework applications.

These are my settings:

Application 1:

<add key="ElasticApm:ServiceName" value="my-app-1" />
<add key="ElasticApm:ServerUrl" value="https://my-apm-server" />
<add key="ElasticApm:SecretToken" value="my-secret-token" />
<add key="ElasticApm:TransactionSampleRate" value="0.5" />

<modules runAllManagedModulesForAllRequests="true">
  <add name="ElasticApmModule" type="Elastic.Apm.AspNetFullFramework.ElasticApmModule, Elastic.Apm.AspNetFullFramework" />
</modules>

Application 2:

<add key="ElasticApm:ServiceName" value="my-app-2" />
<add key="ElasticApm:ServerUrl" value="https://my-apm-server" />
<add key="ElasticApm:SecretToken" value="my-secret-token" />
<add key="ElasticApm:TransactionSampleRate" value="0.5" />

The cause seems to be the gc metrics. I can tell that by running a memory profiler and seeing the ETWTraceEventSource retaining a lot of memory. The memory profiler also reports an event handler leak for ElasticApmModule but doesn't provide more details..

I will try the temporary fix () but this is still an active issue that needs to be reinvestigated.

Here you can see how to memory increased over time. In the first segment we ran the applications with 1.0 transaction sample rate, and then reduced it to 0.5 (seen in the 2nd segment) thinking that it might help (it was before analyzing a memory dump). As you can see, reducing the transaction sample rate it helped a bit but there is still a memory leak. Memory_usage

Please let me know if I can help with more details in the investigation.

Arclight3 avatar Feb 25 '22 18:02 Arclight3

I believe this is addressed in https://github.com/elastic/apm-agent-dotnet/pull/750 and it's a fairly old issue.

I'm closing this now. If this happens again, please make sure you are on a fairly up-to-date agent version.

gregkalapos avatar Sep 28 '22 15:09 gregkalapos

I believe this is addressed in #750 and it's a fairly old issue.

I'm closing this now. If this happens again, please make sure you are on a fairly up-to-date agent version.

Hello @gregkalapos, Elastic APM is causing a memory leak in high data processing microservices in .net core 6 and using version 1.22.0 of Elastic.Apm.NetCoreAll and Elastic.Apm.StackExchange.Redis, thus consuming a lot of ram from the server

Thanks in advance

I tried to optimize the apm performance settings as per the documentation "TransactionSampleRate": 0.2 "StackTraceLimit": 0, "SpanStackTraceMinDuration": "10ms", "CaptureHeaders": false, "MetricsInterval": "1s", "Enabled": true https://www.elastic.co/guide/en/apm/agent/dotnet/current/performance-tuning.html#performance-tuning-stack-traces

But there was no significant improvement

eduvpp avatar Dec 20 '23 02:12 eduvpp