runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Large unmanaged memory growth (leak?) when upgrading from .NET 6 to 8

Open SamWilliamsGS opened this issue 1 year ago β€’ 129 comments

Description

We have a few different services hosted on kubernetes running on .NET. When we try to upgrade from .NET 6 to .NET 8, we see a steep but constant increase in memory usage, almost all in unmanaged memory. It seems to level off at around four times the memory usage in .NET 6, ignoring imposed memory limits, then continues to creep up more slowly depending on workload. So far we haven't seen an upper bound on the amount of unmanaged memory being leaked(?) here. Reproducing the problem in a minimal way has not been possible so far but we do have lots of data gathered about it. πŸ™‚

Configuration

.NET 8, from the docker image mcr.microsoft.com/dotnet/aspnet:8.0, running on x86-64 machines on AWS EC2.

Regression?

Yes, see data below. This issue does not occur on .NET 6, only on 8. We think it might be part of the GC changes from .NET 6 to 7. Give us a shout and we can try to narrow this down by running it on .NET 7.

Data

Initially we switched from .NET 6 to .NET 8 and we monitored memory usage using prometheus metrics. This is what the memory usage graphs look like. Both pods actually reached the 512m limit we'd imposed, and was restarted. After that we reverted to .NET 6, and things went back to normal. On .NET 6, memory usage remained consistently around ~160MB, but as soon as we deployed the upgrade to .NET 8 the memory increased without limit and were restarted once at 15:30 after hitting 512MB, once we returned to .NET 6 things went back to normal. image

We then tried increasing the available memory from 512MB to 1GB and re-deployed .NET 8. It increased rapidly as before, then levelled off at about 650MB and stayed that way until midnight. Service load increases drastically around that time and the memory grew again to about 950MB, where it stayed relatively level again until the service was unwittingly redeployed by a coworker. At that point we reverted back to .NET 6, where it went back to the lower memory level. I think it would have passed the 1GB memory limit after another midnight workload, but we haven't tested that again (yet). image

After trying and failing to reproduce the issue using local containers, we re-deployed .NET 8 and attached the JetBrains dotMemory profiler to work out what was happening. This is the profile we collected, showing the unmanaged memory increases. Interestingly, the amount of managed memory actually goes down over time with GCs becoming more frequent, presumably .NET knows the available memory is running low as the total approaches 1GB. There also seem to be some circumstances where .NET will not allocate from unmanaged memory, since the spikes near the left hand side mirror each other for managed and unmanaged. We had to stop the profile before reaching the memory limit, since kubernetes would have restarted the pod and the profile would have been lost. image And the prometheus memory usage graph, for completeness (one pod is higher than the other because it was running the dotMemory profiler this time, and drops because of detaching the profiler): image

Analysis

The only issue we could find that looked similar was this one, which also affects aspnet services running in kubernetes moving to .NET 7: https://github.com/dotnet/runtime/issues/92490. As it's memory related we suspect this might be to do with the GC changes going from .NET 6 to 7. We haven't been able to get a clean repro (or any repro outside our hosted environments) yet, but please let us know if there's anything we can do to help narrow this down. πŸ™‚

SamWilliamsGS avatar Dec 12 '23 16:12 SamWilliamsGS

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We have a few different services hosted on kubernetes running on .NET. When we try to upgrade from .NET 6 to .NET 8, we see a steep but constant increase in memory usage, almost all in unmanaged memory. It seems to level off at around four times the memory usage in .NET 6, ignoring imposed memory limits, then continues to creep up more slowly depending on workload. So far we haven't seen an upper bound on the amount of unmanaged memory being leaked(?) here. Reproducing the problem in a minimal way has not been possible so far but we do have lots of data gathered about it. πŸ™‚

Configuration

.NET 8, from the docker image mcr.microsoft.com/dotnet/aspnet:8.0, running on x86-64 machines on AWS EC2.

Regression?

Yes, see data below. This issue does not occur on .NET 6, only on 8. We think it might be part of the GC changes from .NET 6 to 7. Give us a shout and we can try to narrow this down by running it on .NET 7.

Data

Initially we switched from .NET 6 to .NET 8 and we monitored memory usage using prometheus metrics. This is what the memory usage graphs look like. Both pods actually reached the 512m limit we'd imposed, and was restarted. After that we reverted to .NET 6, and things went back to normal. On .NET 6, memory usage remained consistently around ~160MB, but as soon as we deployed the upgrade to .NET 8 the memory increased without limit and were restarted once at 15:30 after hitting 512MB, once we returned to .NET 6 things went back to normal. image

We then tried increasing the available memory from 512MB to 1GB and re-deployed .NET 8. It increased rapidly as before, then levelled off at about 650MB and stayed that way until midnight. Service load increases drastically around that time and the memory grew again to about 950MB, where it stayed relatively level again until the service was unwittingly redeployed by a coworker. At that point we reverted back to .NET 6, where it went back to the lower memory level. I think it would have passed the 1GB memory limit after another midnight workload, but we haven't tested that again (yet). image

After trying and failing to reproduce the issue using local containers, we re-deployed .NET 8 and attached the JetBrains dotMemory profiler to work out what was happening. This is the profile we collected, showing the unmanaged memory increases. Interestingly, the amount of managed memory actually goes down over time with GCs becoming more frequent, presumably .NET knows the available memory is running low as the total approaches 1GB. There also seem to be some circumstances where .NET will not allocate from unmanaged memory, since the spikes near the left hand side mirror each other for managed and unmanaged. We had to stop the profile before reaching the memory limit, since kubernetes would have restarted the pod and the profile would have been lost. image And the prometheus memory usage graph, for completeness (one pod is higher than the other because it was running the dotMemory profiler this time, and drops because of detaching the profiler): image

Analysis

The only issue we could find that looked similar was this one, which also affects aspnet services running in kubernetes moving to .NET 7: https://github.com/dotnet/runtime/issues/92490. As it's memory related we suspect this might be to do with the GC changes going from .NET 6 to 7. We haven't been able to get a clean repro (or any repro outside our hosted environments) yet, but please let us know if there's anything we can do to help narrow this down. πŸ™‚

Author: SamWilliamsGS
Assignees: -
Labels:

tenet-performance, area-GC-coreclr, untriaged

Milestone: -

ghost avatar Dec 12 '23 17:12 ghost

We think it might be part of the GC changes from .NET 6 to 7

Does setting export DOTNET_GCName=libclrgc.so(this reverts to the old GC) fix this issue?

MichalPetryka avatar Dec 12 '23 18:12 MichalPetryka

It could also be related to this issue if its a continuous memory growth: https://github.com/dotnet/runtime/pull/95362. Are you able to collect some GCCollectOnly traces so we can diagnose further?

mangod9 avatar Dec 12 '23 22:12 mangod9

We are experiencing the same issue and are running on 7.0.4. From the below graph you can see that the native memory goes up while the GC heap size stays flat: image Then the PODs recycle either from a release or from manual intervention. I will attempt rolling back to the old GC but it is peak season so management may not allow it. We also are in a security hardened environment so running any profilers and/or diagnostic tools would likely take an act of congress so I am eager to see how your testing goes since it appears you are able to do more profiling.

taylorjonl avatar Dec 13 '23 14:12 taylorjonl

Sorry for the slow response here. Good to hear we're not the only ones seeing this @taylorjonl!

@MichalPetryka we tried the old GC setting but unfortunately no dice, the memory graphs look the same as before 😒. We'll try out more of the suggestions here after new year but I'm on holiday until then, so this issue might go quiet for a bit. Thanks everyone for the help so far! image

SamWilliamsGS avatar Dec 15 '23 11:12 SamWilliamsGS

Maybe it's W^X or #95362 like mentioned before. Can you try export DOTNET_EnableWriteXorExecute=0?

MichalPetryka avatar Dec 15 '23 17:12 MichalPetryka

W^X should not cause unbound native memory growth. There are other sources of native growth I have seen in my debugging similar issues for customers in the past few months:

  • Using XmlSerializer created using the constructor with XmlAttributeOverrides argument. See https://learn.microsoft.com/en-us/dotnet/api/system.xml.serialization.xmlserializer?view=net-8.0#dynamically-generated-assemblies for the reasons.
  • A leak from OpenSSL related stuff - this is currently being investigated, as it occurs at some rare case and it is not clear yet what is the culprit.

It is also possible that the native memory leak is caused by a tiny GC memory leak - a case when tiny managed object holds a large block of native memory alive. You would not see such a leak on the GC memory graph. That is also a possible cause related to OpenSSL where runtime uses SafeHandle derived types to reference possibly large data structures - like client certificates - allocated by the OpenSSL. I've seen cases when there was a certificate chain upto a 1GB large.

To try to figure out the culprit, it would be helpful to take a dump of the running process at a point when it has already consumed a large amount of memory and then investigate it using a debugger with SOS plugin or the dotnet-dump analyze command. I can provide more details on that.

Also, if you'd be able to create a repro that you'd be able to share with me, I'd be happy to look into it myself.

janvorli avatar Dec 19 '23 10:12 janvorli

It could also be related to this issue if its a continuous memory growth: #95362. Are you able to collect some GCCollectOnly traces so we can diagnose further?

@mangod9 just to double check, since this is happening on linux, would perfcollect with -gccollectonly flag as described here work the same way? πŸ™‚

SamWilliamsGS avatar Jan 02 '24 14:01 SamWilliamsGS

Maybe it's W^X or #95362 like mentioned before. Can you try export DOTNET_EnableWriteXorExecute=0?

@MichalPetryka we gave disabling W^X a go today and unfortunately no difference. Is there a nightly docker image we could use to try out the TLS fix? πŸ™‚ (I tried looking at the docs but got a bit mixed up with how backporting works in this repo, sorry!)

SamWilliamsGS avatar Jan 02 '24 16:01 SamWilliamsGS

Is there a nightly docker image we could use to try out the TLS fix?

For .NET 9 daily build testing, install script can be used as follow:

VERSION=9
DEST="$HOME/.dotnet$VERSION"

# recreate destination directory
rm -rf "$DEST"
mkdir "$DEST"

# download and install
curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --quality daily --channel "$VERSION.0" --install-dir "$DEST"

# add nuget feed
cat > "$HOME/.nuget/NuGet/NuGet.Config" <<EOF
<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <packageSources>
    <add key="nuget.org" value="https://api.nuget.org/v3/index.json" protocolVersion="3" />
    <add key="dotnet$VERSION" value="https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet$VERSION/nuget/v3/index.json" />
  </packageSources>
</configuration>
EOF

PATH="$DEST":$PATH
DOTNET_ROOT="$DEST"
export PATH DOTNET_ROOT
# dotnet --info
# dotnet publish ..
# etc.

after changing net8.0 to net9.0 in *.csproj files, rebuild your docker image. After the testing, revert these changes and rebuild the docker image again (to bring back net8.0).

am11 avatar Jan 02 '24 18:01 am11

We're also seeing the exact same issue

@janvorli do you still need any memory dumps? Happy to share privately

jtsalva avatar Jan 03 '24 13:01 jtsalva

@jtsalva that would be great! My email address is my github username at microsoft.com

janvorli avatar Jan 03 '24 13:01 janvorli

@am11 I wonder whether that docker file might be useful to have in the docs in this repo.

danmoseley avatar Jan 03 '24 14:01 danmoseley

[snip]

To try to figure out the culprit, it would be helpful to take a dump of the running process at a point when it has already consumed a large amount of memory and then investigate it using a debugger with SOS plugin or the dotnet-dump analyze command. I can provide more details on that.

Also, if you'd be able to create a repro that you'd be able to share with me, I'd be happy to look into it myself.

Thank you very much for the detailed response @janvorli it's really appreciated πŸ™‚ We have a profile from JetBrains' dotMemory and a few snapshots taken with dotTrace. Would those be helpful to you in lieu of a dotnet-dump? (We can probably also get the latter if that's better to have, will need to get clearance to send to you regardless). I've had difficulty getting a clean repro for this since it only seems to happen when it's hosted on kubernetes but will keep working on it πŸ˜„

SamWilliamsGS avatar Jan 03 '24 15:01 SamWilliamsGS

@SamWilliamsGS I need a dump that contains whole dump of the process memory. I am not sure if the dotMemory profile contains that or not. In case you have sensitive data in the dump, I can just explain how to look at the interesting stuff in the dump and you can do it yourself.

janvorli avatar Jan 03 '24 16:01 janvorli

For .NET 9 daily build testing, install script can be used as follow:

Thanks @am11. Looking at https://github.com/dotnet/runtime/pull/95362 it looks like this fix was backported to .NET 8, is that correct? And does that mean we'd expect to see the fix already in the standard .NET 8 docker images at this point? πŸ™‚

SamWilliamsGS avatar Jan 04 '24 10:01 SamWilliamsGS

The backport PR (#95439) has a milestone of 8.0.2, so I don't think you'll see it in non-daily Docker images for .NET 8 until February's servicing release.

martincostello avatar Jan 04 '24 11:01 martincostello

@martincostello I'm curious is there any place where information about next planning servicing release date and fixes is published?

So far I only found this page referring to "Patch Tuesday" every month. https://dotnet.microsoft.com/en-us/platform/support/policy/dotnet-core#servicing Github milestone doesn't have planned release date either https://github.com/dotnet/runtime/milestone/133

We've updated some of services to .net8 and faced similar unmanaged memory leaks.

dlxeon avatar Jan 04 '24 17:01 dlxeon

I'm afraid I don't know the definitive answer (I'm not a member of the .NET team), I just know from prior experience that there's typically a release on the second Tuesday of every month to coincide with Patch Tuesday. What makes those releases though is often just detective work from looking at what's going on in GitHub PR/issues as releases with security fixes don't get worked on in public.

martincostello avatar Jan 04 '24 18:01 martincostello

Has anyone gained further insight into this? I'm also experiencing very similar issues and am working on getting some dumps and traces now to further analyze. It seems to be impacting some of our kubernetes (k3s) services deployed to edge locations with tight memory constraints. Previously these services were on .net 6 and would be fairly stable with limited memory (ranges of 128 Mib - 256 Mib).. Since uplifting them to .NET 8 we are experiencing higher base memory usage plus OOMKilling going on quite frequently as memory seems to consistently grow over time with just k8s probes/health-checks running... Enabling DATAS and GCConserve = 9 does seem to greatly improve things, but I still have tests that fail that used to pass on .net 6. The tests in question all do some batch operations that require more memory than normal load and with the higher usage in .net 8 they just cause the POD to get OOMKilled. It's hard to narrow down exact changes as with every .net uplift there's also countless dependency uplifts also. Most of these services in question also went from EFCore 6 -> EFCore 8. I have tried some of the nightly docker images for aspnet 8 but still seeing these tests fail when doing larger batch operations so there must be some additional regression causing a higher memory footprint

nhart12 avatar Jan 05 '24 20:01 nhart12

as memory seems to consistently grow over time

can you quantify the amount of growth? Could be related to https://github.com/dotnet/runtime/pull/95439 as suggested above.

mangod9 avatar Jan 05 '24 20:01 mangod9

With DATAS enabled it doesn't seem to grow (or possibly just has more aggressive GC to account for the unmanaged leak?) .. but memory usage is simply just higher. I'll try to collect some metrics next week. I'll likely have to revert some PODS to .net 6 to get some baselines and compare as we weren't watching it as closely until the OOM's. Without DATAS enabled, one of our pods (which normally averages ~100Mib) would slowly over a few hours trickle more and more memory until getting killed at 128Mib)

Wouldn't the nightly aspnet images have the TLS leak fix in them?

nhart12 avatar Jan 05 '24 20:01 nhart12

I have spent last couple of months on and off investigating various cases of native memory growth / possible leaks. One of the cases I've found was just fixed by #96561. Once it is merged, we should backport it to .NET 8. Another known issue is using dynamic assemblies created with AssemblyBuilderAccess.Run mode. These are never freed by design (AssemblyBuilderAccess.RunAndCollect mode enables the resulting assemblies to be collected). There are components using that, like the XmlSerializer that with some of the constructors creates dynamic assembly for each serialization over and over. Last case I've seen was the TLS leak. There also might be a problem with GC not running as often as needed due to tiny GC heap objects holding large native memory allocations alive (like the SafeHandle derivatives used by the crypto / SSL stuff). I have no real proof of this really happening though.

In general, it is important to differentiate between memory growth that doesn't lead to OOM and one that does. I have investigated a repro that was seemingly growing infinitely, but when running in a container with limited memory, it reached certain memory level and then it stopped growing, since the GC started to kick in more often and keeping the memory in check.

janvorli avatar Jan 05 '24 21:01 janvorli

There also might be a problem with GC not running as often as needed due to tiny GC heap objects holding large native memory allocations alive (like the SafeHandle derivatives used by the crypto / SSL stuff). I have no real proof of this really happening though.

This theory honestly tracks with what I've been seeing, and perhaps why with DATAS and more aggressive GC (GCConserve=9) we get much less frequent OOM's..

In general, it is important to differentiate between memory growth that doesn't lead to OOM and one that does. I have investigated a repro that was seemingly growing infinitely, but when running in a container with limited memory, it reached certain memory level and then it stopped growing, since the GC started to kick in more often and keeping the memory in check.

I'll keep this in mind as I investigate next week.. Just so I'm clear. If I use the nightly aspnet-8 images, would those contain the runtime built from release/8.0-staging branch or is it just things that have been backported into release/8.0 branch?

nhart12 avatar Jan 05 '24 21:01 nhart12

We have 50+ different small microservices running .Net6/7/8 in Kubernetes. Infrequently used services have 200-300Mb memory limit. Pods that were working fine on .Net6 with such limits started to fail after .Net8 update because of OOM kills. We had to increase memory limits for all pods and still observe unmanaged leaks.

Few observations for .Net6 => .Net8 migration

  1. Looks like libraries itself became a bit bigger. Same app before-after migration, 6.x vs 8.x libraries, 2 memory dumps, "lm" command for dotnet dump analyze reports ~15Mb higher for .net8 dump. This is 105Mb => 120Mb for our typical small service (asp.net core, OpenTelemetry, Microsoft.Data.SqlClient + Dapper)

  2. .Net7 and .Net8 projects allocate a bit more memory comparing to similar .Net6 projects (~30Mb for small services that are mostly idle except of regular telemetry reading through OpenTelemetry 1.6/1.7)

  3. .Net8 does less frequent garbage collections, memory utilization might be a bit higher. As .Net allocates 75% by default for managed part and 25% for unmanaged, my theory is that when we have 200Mb pod limit, that means only 50Mb is "allocated" for unmanaged part, but in reality, .Net requires more than 50Mb unmanaged. I guess it assumes there is plenty of managed memory available and no need to do aggressive GC. It should trigger GC when reaching 90% physical memory, but I think in case of short allocation spike Kubernetes kills pod faster, than .Net decides to do GC. We added DOTNET_GCHeapHardLimit and that improves situation. However because of unmanaged leaks (?) application is still getting killed by k8s.

  4. Memory dumps shows thousands of objects in freachable queue One service "leaks" some GRPC-related data another "leaks" SqlConnections. Common part in those dumps is Microsoft.Win32.SafeHandles.SafeX509Handle, but maybe it is not real a leak. Theory is that .Net doesn't think it is good time to run finalizer thread to cleanup which causes that "leak" to grow. I tried to reproduce it locally and if I force .Net to do aggressive GC and then take memory dump, I don't have as much objects there, if I do memory dump using task manager, I see lots of objects in freachable queue just as we see in k8s production.

  5. Most likely there are different leaks reasons One of our services β€œleaks” at steady pace, linear. Another service quickly grows in size, after some point it starts growing slower then grows even slower. Service1 (.Net8): Leak pattern 1

Service2 (.Net6 => .Net8): Leak pattern 2

dlxeon avatar Jan 08 '24 08:01 dlxeon

Hi all, we have identified and fixed one leak related to OCSP stapling. It would be great to get a verification from any of you if the fix resolves the memory leak you are observing. @SamWilliamsGS , @jtsalva, @nhart12, @dlxeon.

ocsp-fix.zip

To test the fix, publish your app as self-contained and overwrite the libSystem.Security.Cryptography.Native.OpenSsl.so library in the publish output.

If you'd rather go through the trouble and compile the lib yourself, they are coming from these branches

  • https://github.com/dotnet/runtime/tree/backport/pr-96561-to-release/7.0-staging
  • https://github.com/dotnet/runtime/tree/backport/pr-96561-to-release/8.0-staging

To build them, running ./build.sh -s Libs.Native -lc release should be enough.

rzikm avatar Jan 08 '24 11:01 rzikm

We have 50+ different small microservices running .Net6/7/8 in Kubernetes. Infrequently used services have 200-300Mb memory limit. Pods that were working fine on .Net6 with such limits started to fail after .Net8 update because of OOM kills. We had to increase memory limits for all pods and still observe unmanaged leaks.

@dlxeon thank you for the analysis. Your symptoms seem nearly identical to mine in tight memory limits.. and it's the same sort of objects I was seeing in the memory dumps but without any real proof of leaks as manually running GC would clear them.. I was hoping to collect these kind of metrics this week (time permitting) comparing to .net 6 but I suspect you are correct in your point # 4 about K8s being faster to oom kill than the finalizer decides to run. My initial suspicion was GC with DATAS enabled was viewing k8s probes as traffic and so was always prioritizing throughput and delaying GC more and more

nhart12 avatar Jan 08 '24 14:01 nhart12

@rzikm Thank you for solution. That looks like root cause for one of our services. I took a dump of "Service 2" which now leaked about 1Gb and found 50000+ copies of some certificate data in dump and numerous binary data strings like http://ocsp.rootca1.amazontrust.com0, http://ocsp.r2m03.amazontrust.com, http://ocsp.rootg2.amazontrust.com etc.

I will try to apply fix and reproduce issue locally, but complication is that it reproduces under load in prod environment, but we don't have good setup yet to reproduce that in staging environment or locally.

dlxeon avatar Jan 08 '24 15:01 dlxeon

http://ocsp.rootca1.amazontrust.com0, http://ocsp.r2m03.amazontrust.com, http://ocsp.rootg2.amazontrust.com etc

the OCSP server you are hitting probably returns responses which include a certificate, these urls look like urls from where OCSP staple for that cert can be retrieved.

@dlxeon If it helps, the specific trigger for the leak are outbound requests via HttpClient to a server which serves OCSP staples (and authenticates clients via client certificate). You could try dumping the memory of your test/staging instances and look for similar strings in there. It is possible there will be some live instances, but their number should not grow.

rzikm avatar Jan 08 '24 16:01 rzikm

@rzikm I think root cause of our issue is not OCSP bug. I used your fixed version and got same results. Then after careful code review I found unexpected problem on our side - there were scenarios when HttpClient was created with HttpClientFactory and not disposed (facepalm). After we fixed that issue, local tests showed reduction of memory used by service process and less instances of binary strings containing http://ocsp... in memory dumps taken after same test load. Late next week we will release that code to production and then we can definitely tell if it was root cause for "Service 2"

dlxeon avatar Jan 10 '24 12:01 dlxeon