aspnetcore
aspnetcore copied to clipboard
Health Check Causes Memory Leak in .NET 8
Is there an existing issue for this?
- [X] I have searched the existing issues
Describe the bug
In a dot net core 8 application, health checks executed on a regular interval eventually leak all available memory, causing the container to crash.
Expected Behavior
Health check should not leak memory. Memory usage should remain relatively consistent or at least be reclaimed by garbage collection.
Steps To Reproduce
Reproduced with the following system configuration:
- Windows 11 build 22631.3155
- SDK: 8.0.102
- Runtime: 8.0.2
- Setup new dot net core web API application using .NET 8 as the target framework
- Modify the auto generated
Program.cs
to include the following block of code
builder.Services.AddHealthChecks();
var app = builder.Build();
app.MapHealthChecks("/hc", new HealthCheckOptions
{
AllowCachingResponses = false,
});
From there, run the program with a memory profiler (I used DotMemory version 2023.3.3). Notice that each time /hc
is called, the health check leaks ~100KB of RAM.
Now, modify the *.csproj
file to change the target framework to dot net 6;
<TargetFramework>net6.0</TargetFramework>
Note:
- You will need to remove the OpenAPI reference in the
*.csproj
- You will also need to remove the
.WithOpenApi();
extension inProgram.cs
Now repeat the profiling exercise. Notice that, although the memory is not 100% constant, it doesn't really exhibit "leaking" behavior.
Please advise as to how to proceed. If you setup a docker container to poll the service at ~10 second intervals, you'll end up with hundreds of MBs of RAM leaked over a few hour timeframe.
Severity: SEVERE; risk of production crash. Framework is effectively not usable.
Exceptions (if any)
Memory should not leak during basic health check.
.NET Version
8.0.102
Anything else?
Similar to issue reported in the runtime itself, https://github.com/dotnet/runtime/issues/99304
Additional notes;
Did some more tests here, again using the sample API with a GET endpoint of /weatherforecast
- If I poll
/weatherforecast
at ~10 second intervals:- No memory leak is observed
- Memory usage is more or less consistent, regardless of if call is made to
builder.Services.AddHealthChecks();
- If I poll the
/hc
endpoint (see sample code above)- Memory leak is observed
- The average leak per call to
/hc
is ~100KB (some health checks leak more than this, some less, some none at all)
The problem here is not the ~100KB that is leaked, the problem is the compounding nature of the requests. If the HC is polled by container infrastructure where the container orchestration specifies a max memory allocation, we'll eventually run out of RAM and crash.
I'm trying to reproduce this, as we are also experiencing memory problems with .Net 8 which most of them seemed to have disappeared with 8.0.2
. In Startup.cs
, we do:
var healthChecksBuilder = services.AddHealthChecks();
healthChecksBuilder.AddCheck(...)
...
app.UseHealthChecks("/health", new HealthCheckOptions() {
AllowCachingResponses = false,
});
I then run this locally in a alpine:8.0
container and call the health endpoint in a loop. When I do docker stats
, I don't see the memory rising. Is there anything I am missing?
@Kiechlus I realized after I posted this that the SDK we're using is actually 8.0.1 - I'm not sure why that would make a difference, but I will try to upgrade the SDK to 8.0.2 and see if that fixes the issue.
Also, my container base image isn't alpine, it's photon. Again, not sure if that makes a difference.
I am out of the office this week but should be able to test this out next week
@debracey If you could keep us posted after you are back on whether upgrading to 8.0.2 fixed the issue taht would be appreciated. We are seeing unexplained memory 'leaks' after upgrading all our (kubernetes) services to .NET 8. Any data point will be helpful for us.
@GimmeDaKitty, @Kiechlus I Went back and looked at my notes on this now that I am back in office. In the test described above I wasn't using a container at all, I was just running the service directly on windows out of my IDE (Rider 2023.3.3).
I upgraded my SDK to 8.0.203 and the issue seems to be resolved. I no longer see memory increasing (in the unmanaged memory section) with each call to HC.
I'll reach out to my infrastructure team to see what SDK our build pipelines use. I'll report back again once I can retest with an actual productionalized service.
@debracey is this still resolved for you?
@GimmeDaKitty @Kiechlus are you still seeing this on latest versions (8.0.4) ?
Wondering whether this can be closed. Part of the reason I'm here is it seems several issues have contributed to https://github.com/dotnet/runtime/issues/95922 and to make progress on the whole we need to look at them all.
Adding @andreiZi, who might still have a repro for this.
Linking that comment : https://github.com/dotnet/runtime/issues/95922#issuecomment-2105189870
From what I observed, my assumption is that the microservice receives health check requests every 15 seconds, and it seems that something is not being cleaned up properly with each request.
@andreiZi is there a way for you to repeat the experiment but with health checks disabled?
Hello @danmoseley,
I've set up two scenarios as follows:
Scenario 1: Build Services with ASP.NET 8.0.4 and SDK 8.0.203 (Healthchecks enabled)
Here is the Dockerfile used:
FROM mcr.microsoft.com/dotnet/aspnet:8.0.4 AS base
WORKDIR /app
EXPOSE 80
EXPOSE 443
ENV ASPNETCORE_URLS=http://*:80
FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:8.0.203 AS build
WORKDIR /src
ARG TARGETARCH
COPY ["BuildingBlocks/BuildingBlocks.Common/BuildingBlocks.Common.csproj", "BuildingBlocks/BuildingBlocks.Common/"]
COPY ["BuildingBlocks/BuildingBlocks.DataAccess/BuildingBlocks.DataAccess.csproj", "BuildingBlocks/BuildingBlocks.DataAccess/"]
COPY ["Services/User/User.Api/User.Api.csproj", "Services/User/User.Api/"]
COPY ["Services/User/User.Application/User.Application.csproj", "Services/User/User.Application/"]
COPY ["Services/User/User.Core/User.Core.csproj", "Services/User/User.Core/"]
COPY ["Services/User/User.DataAccess/User.DataAccess.csproj", "Services/User/User.DataAccess/"]
RUN dotnet restore -a ${TARGETARCH} "Services/User/User.Api/User.Api.csproj"
COPY . ./
WORKDIR "Services/User/User.Api/"
RUN dotnet build "User.Api.csproj" -c Release -o /app/build
FROM build AS publish
RUN dotnet publish "User.Api.csproj" -c Release -o /app/publish
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "User.Api.dll"]
Scenario 2: Deploy Services Completely Without Health Checks
My implementation of the health checks is as follows:
services.AddHealthChecks();
app.UseEndpoints(endpoints =>
{
endpoints.MapControllers().AddAsyncValidationFilter();
endpoints.MapHealthChecks("/health");
});
I will now observe and monitor the behavior. Probably in a few hours, I will report my observations.
Unfortunately, this approach did not solve the problem.
This morning, both services reached nearly 500MiB - 700MiB of RAM consumption. I managed to extract dumps from both services. Here is the SharePoint link where you can access them: SharePoint Link.
Below are the outputs from dumpheap -stat
showing the largest memory consumers for each system:
System with Health Checks Completely Disabled (Running in Kubernetes with ASP.NET 8.0.4 and SDK 8.0.203):
7fc5f52a00c8 21,366 683,712 Serilog.Events.StructureValue
7fc5f311bf28 21,369 854,760 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7fc5f3ae8948 21,573 864,880 Serilog.Events.LogEventProperty[]
7fc5f5299c70 22,369 1,789,520 Serilog.Events.LogEvent
7fc5f52ab110 22,369 1,789,520 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7fc5f5a2db20 5,051 2,207,624 System.Collections.Generic.KeyValuePair<System.String, Serilog.Events.LogEventPropertyValue>[]
7fc5f07b89e0 27,007 2,640,180 System.Int32[]
7fc5f4e2e680 181,134 4,347,216 Serilog.Events.ScalarValue
7fc5f2e6b438 137,286 4,393,152 Serilog.Events.LogEventProperty
7fc5f08863b8 6,140 8,878,194 System.Char[]
7fc5f52abf90 26,471 11,840,640 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7fc5f07bd7c8 184,071 43,804,328 System.String
7fc5f1349800 4,028 50,552,143 System.Byte[]
55ffb238c7a0 35,601 180,890,880 Free
Total 800,018 objects, 320,912,524 bytes
System with Health Checks Enabled (Running in Kubernetes with ASP.NET 8.0.4 and SDK 8.0.203):
Example 1
7f4d61560980 54,053 1,297,272 System.Int32
7f4d66854398 5,001 2,104,024 System.Collections.Generic.KeyValuePair<System.String, Serilog.Events.LogEventPropertyValue>[]
7f4d63f3ce10 18 2,408,488 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent[]
7f4d660c4840 137,406 4,396,992 Serilog.Events.StructureValue
7f4d64909150 145,247 5,665,856 Serilog.Events.LogEventProperty[]
7f4d63c8b438 263,604 8,435,328 Serilog.Events.LogEventProperty
7f4d63f3bf28 251,316 10,052,640 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7f4d616a63b8 15,677 11,471,428 System.Char[]
7f4d615d89e0 263,406 16,636,700 System.Int32[]
7f4d660bd788 252,316 20,185,280 Serilog.Events.LogEvent
7f4d660cf888 252,316 20,185,280 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7f4d62169800 4,579 30,148,160 System.Byte[]
7f4d65c625b8 1,616,214 38,789,136 Serilog.Events.ScalarValue
7f4d660f0810 261,573 67,245,936 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7f4d615dd7c8 311,686 74,194,268 System.String
55716dec4620 280,214 116,343,616 Free
Total 4,349,002 objects, 441,525,143 bytes
Example 2:
Microsoft.AspNetCore.Server.Kestrel.Core.Internal.ConnectionLogScope+<GetEnumerator>d__7
7f49cb441260 14,776 354,624 Microsoft.Extensions.Diagnostics.HealthChecks.HealthStatus
7f49cb730638 8,460 473,760 System.IO.StringWriter
7f49c7e63c38 13,205 633,840 System.Text.StringBuilder
7f49c6c140f0 37,077 889,848 System.Double
7f49c6ba0980 96,629 2,319,096 System.Int32
7f49c957ce10 7 4,751,336 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent[]
7f49cb703148 247,891 7,932,512 Serilog.Events.StructureValue
7f49c92cb438 282,483 9,039,456 Serilog.Events.LogEventProperty
7f49c9f49150 253,965 9,734,272 Serilog.Events.LogEventProperty[]
7f49c6ce63b8 14,298 12,677,154 System.Char[]
7f49c957bf28 494,089 19,763,560 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7f49c6c189e0 499,985 29,983,368 System.Int32[]
7f49c77a9800 4,419 34,278,207 System.Byte[]
7f49cb6fcc78 494,139 39,531,120 Serilog.Events.LogEvent
7f49cb70e190 494,139 39,531,120 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7f49cb2a25b8 3,117,157 74,811,768 Serilog.Events.ScalarValue
7f49cb70f010 498,426 118,985,088 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7f49c6c1d7c8 492,305 126,014,400 System.String
I've continued to observe the systems and have noticed a significant difference in memory consumption.
On the system where health checks are enabled, there is definitely higher memory usage. Conversely, the development system shows a much more stable memory consumption curve. When compared to an environment running .NET 6, this represents a slight increase—about 50-100MB more.
Here, you can see the API gateway's memory usage over the last 9 hours, which stabilized around 8 PM:
n contrast, this is the system with health checks enabled, which saw memory usage ramp up to 762MB over 4 hours:
I suspect that .NET 8 generally consumes more memory than .NET 6. I don’t mind the increase, but can you confirm if this is right?
I will continue to monitor the situation and hope that the workaround on the development system proves effective. Although health checks are crucial for us, resolving the memory issue would be highly beneficial.
I can confirm what @andreiZi posted above. The leak appears relatively slow at first, and then eventually GC will kick in (albeit much later than .NET 6), and stabilize stuff for a while before memory creeps up slowly again repeating the cycle. You can forcibly trigger the leak by hammering the health check at < 10 second intervals. The leak is made worse if your HealthCheckResult
returns a large amount of string data in the dictionary (e.g., the name + version of every DLL your app uses).
My statement before re/this problem being solved with 8.0.203 is only partially correct. Some of our microservices are still leaking even at 8.0.203, but it's difficult for me to track down because they may have other memory issues that we're only now finding as a byproduct of this investigation. That being said, I still stand by my prior statement that 8.0.203 improves the situation significantly.
However, in speaking with my colleagues that handle overall environment scaling; we believe that this issue still needs to be investigated because even a slow leak will mess up the pod scaling algorithms that we have in place, and we'll eventually be paying for a bunch of executing pods unnecessarily.
I need to correct my previous assumption that disabling health checks resolved our issue; the services still ran out of RAM. As @debracey mentioned, there could be multiple reasons causing the leak. In my dump, System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[] is one of the largest consumers of memory.
@debracey, we are facing the same issue. Unfortunately, it was my oversight in missing the memory leak problem when upgrading the services to .NET 8 in production. Now, we find ourselves in a situation where our HPAs are scaling the pods excessively, much like wildfire. Manual intervention is required because we do not wish to change our deployment configuration.
Here you can see the service I posted about 1 hour ago. Just as @debracey described, it stabilizes the memory consumption for a while, but now we can clearly see that it starts to increase again.
Here are the last relevant symbols from the output of dumpheap -s
:
7fca8bca63b8 312 455,082 System.Char[]
7fca906bdb58 22,620 723,840 Serilog.Events.StructureValue
7fca8e28b438 23,462 750,784 Serilog.Events.LogEventProperty
7fca8e53bf28 22,623 904,920 Serilog.Sinks.Grafana.Loki.Models.LokiLogEvent
7fca8ef08948 23,187 932,320 Serilog.Events.LogEventProperty[]
7fca906b7718 22,623 1,809,840 Serilog.Events.LogEvent
7fca906d8bb0 22,623 1,809,840 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
7fca8bbd89e0 23,440 1,967,548 System.Int32[]
7fca9024e6b0 191,221 4,589,304 Serilog.Events.ScalarValue
7fca906d9a30 22,905 8,306,688 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7fca8bbdd7c8 134,245 26,716,442 System.String
7fca8c769800 4,299 50,249,261 System.Byte[]
560b529c1790 13,560 75,716,040 Free
Yeah. Same issue. My graphs look basically the same. (I'm not able to post dumps/graphs of my services due to company policy).
Thank you both. Did you try looking at what is rooting some of those dictionaries? !gcroot on some randomly selected ones
Hey @danmoseley
Yes, i consistently receive the following output:
Found 0 unique roots.
Yes. We also see the same. I thought the output was "No roots found" -- but I'm not at my work computer so presumably @andreiZi has the correct output.
@danmoseley
I've obtained a dump from one of my previous .NET 6 applications for reference. These applications are essentially the same as the .NET 8 versions, with only upgraded packages and a few syntax adjustments.
Here is the output (note: the service has been running for over 156 days):
7fc38b6d5270 41,696 1,000,704 Serilog.Events.ScalarValue
7fc38862e830 22,357 1,073,136 System.Text.StringBuilder
7fc3898ee320 38,939 1,246,048 Serilog.Events.LogEventProperty
7fc38beb6460 7,854 1,958,448 System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
7fc387cc9190 10,529 3,262,729 System.Byte[]
7fc3872f67d8 22,508 11,251,310 System.Char[]
7fc3872a1038 40,739 19,927,984 System.String
Total 461,985 objects, 57,327,493 bytes
I can observe the same symbols as in .NET 8, but with significantly less memory consumption. As I mentioned, the service has been running for an extended period of time.
Here is the graph showing the memory consumption over the last 24 hours. The service experienced moderate usage, but not much load.
(Don't mind the small spike at the end; it occurred when I created a dump.)
I have the opportunity to share as much information as necessary since the development system of the application is very young and doesn't contain any sensitive information. If you need more details, please let me know.
Found 0 unique roots.
@mangod9 can you advise?
Removing Serilog somehow resolved the issue. I'll now try re-enabling the health checks to see if that was the cause.
@debracey are you using also Serilog in your company ?
Ok interesting, did you have to upgrade SeriLog when moving to 8 or is it the same version across 6 and 8? We will also look at the dump you have shared on the other issue to figure out if we can find the rooting for those Serilog dictionaries.
We don’t use Serilog. We use NLog, I’m not sure of the version or the upgrade history on it. I’ll check it on Monday and see when we upgraded it and if a new version is available.
I suspect that the package Serilog.Sinks.Grafana.Loki
might be causing the memory leak. After removing it, the memory usage trends look much better. Before jumping to any conclusions, I'll continue to monitor the situation. I had considered removing this package anyway but wanted to share this information here for anyone experiencing similar memory issues and possibly using this package.
@mangod9, does .NET's garbage collector clean up memory that might be improperly handled by a package?
Memory Usage Trends (Health checks & Serilog now enabled)
I also updated Serilog when transitioning to .NET 8. The Serilog.Sinks.Grafana.Loki
package is on the latest version.
My assumption is that this package caches logs when Loki is unreachable (we haven’t set up Loki in the system yet), but this is just a hypothesis.
In the previous dumps, we also observed the symbol:
System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
This symbol was not present in .NET 6 and appears to be growing. Previously, we only saw:
System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>
After removing the Serilog.Sinks.Grafana.Loki package, this symbol also vanished.
My particular issue was resolved using the approach I described above. The package had an infinite queue limit when Loki was unreachable, which filled up the memory until an OutOfMemoryException
was thrown.
Initially, it was a combination of both health checks and the leaking package. However, with ASP.NET 8.0.4 and SDK 8.0.203, along with the proper adjustments to the logger configuration, all my issues were resolved.
@debracey, if your error still persists, I recommend getting a dump of the process and inspecting the highest consumers. In my case, it was System.Collections.Generic.Dictionary<System.String, Serilog.Events.LogEventPropertyValue>+Entry[]
which pinpointed the issue for me.
After weeks of debugging and head-cracking, I can finally utilize those C#12 features 🚀. Thank you for the help, everyone; you do a magnificent job!
I went back and tested this again. I've now updated my local SDK to dot net 8.0.300
and the actual now reports the information below. My services don't use SeriLog, so there was nothing to update there. NLog is what we use and that is/was @ 5.2.4
(latest = 5.3.2
).
I tested at 5.2.4
to keep things consistent. I also tested at 5.3.2
for comparison.
The situation is much improved, but we're still seeing leaks - it's similar to this other post about dot net 8 leaks.
Unfortunately, I cannot post the full debug screenshot by company policy. However. here's an overview of what it looks like just for String
.
Generation | Number of Object |
---|---|
Gen 0 | 69,991 |
Gen 1 | 96 |
Gen 2 | 4,345 |
Total | 74,432 |
It is possible that there's something in our own codebase that is exacerbating the issue. But it seems really strange that we're retaining 70K strings in gen 0. All we're doing is sitting around health checking.
Test Environment:
Host:
Version: 8.0.4
Architecture: x64
Commit: 2d7eea2529
RID: linux-x64
.NET SDKs installed:
No SDKs were found.
.NET runtimes installed:
Microsoft.AspNetCore.App 8.0.4 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 8.0.4 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
cc/ @andreiZi , @danmoseley
Did further research on this, specifically look at the issues linked here:
-
Guard against -1 returned from sysconf
- Output of
getconf -a | grep LEVEL4_CACHE
on my photon base image returns 0 - So it would seem that I am not hitting that specific issue
- Output of
-
Upgrading to dot net 8 causes a sudden increase in LOH size
- Adding an
ENV DOTNET_GCgen0Size=0x1E00000
command immediately before starting my application caused the results outlined below.
- Adding an
At first adding DOTNET_GCgen0Size=0x1E00000
looked promising. But after a while (30 mins or so) the same problem occurred. Memory started creeping up, and even after an apparent GC event where the memory dropped back down, it never dropped back to the baseline and would creep back up. The number of objects stuck in the various generations is higher than my previous post. Again looking only at strings, Gen0 had 175K unreachable strings. The other generations were similar.
So, still basically stuck. Something seems to be causing a huge number of objects -- mostly strings -- to remain in Gen0
Did further research on this, specifically look at the issues linked here:
1. [Guard against -1 returned from sysconf](https://github.com/dotnet/runtime/pull/100502#issue-2219207285) 1. Output of `getconf -a | grep LEVEL4_CACHE` on my photon base image returns 0 2. So it would seem that I am not hitting that specific issue 2. [Upgrading to dot net 8 causes a sudden increase in LOH size](https://github.com/dotnet/runtime/issues/101906#issuecomment-2096685440) 1. Adding an `ENV DOTNET_GCgen0Size=0x1E00000` command immediately before starting my application caused the results outlined below.
At first adding
DOTNET_GCgen0Size=0x1E00000
looked promising. But after a while (30 mins or so) the same problem occurred. Memory started creeping up, and even after an apparent GC event where the memory dropped back down, it never dropped back to the baseline and would creep back up. The number of objects stuck in the various generations is higher than my previous post. Again looking only at strings, Gen0 had 175K unreachable strings. The other generations were similar.So, still basically stuck. Something seems to be causing a huge number of objects -- mostly strings -- to remain in Gen0
@debracey
shouldn't DOTNET_GCgen0Size
be DOTNET_GCgen0size
? (lowercase s)
Reference: https://github.com/search?q=repo%3Adotnet%2Fruntime%20GCgen0size&type=code
@jbroumels The other post mentioned capital S. I can try again with lowercase s --- will report back