aspnetcore
aspnetcore copied to clipboard
Remove some `GCHandle`s from `SendHeaders()`
Description
This introduces a new IBufferWriter<T> implementation that manages unmanaged block allocations. The
intent here was to remove the allocation of handles and temporary managed arrays allocations and
and replace them with native memory.
Instead of using an unmanaged memory allocator like NativeMemory, the GC.AllocateUninitializedArray() could be used to create pinned memory. I've rejected this because it puts the onus on the GC to collect the memory whereas with the unmanaged memory it can be freed precisely when it is no longer needed. Since the pinned heap is presently considered only on a gen2 collection it leaves this unused memory around far longer than needed and creates unnecessary pressure.
/benchmark list
Crank Pull Request Bot
/benchmark <benchmarks[,...]> <profiles[,...]> <components,[...]>
Benchmarks:
plaintext: TechEmpower Plaintext Scenario - ASP.NET Platform implementationjson: TechEmpower JSON Scenario - ASP.NET Platform implementationfortunes: TechEmpower Fortunes Scenario - ASP.NET Platform implementationyarp: YARP - http-http with 10 bytesmvcjsoninput2k: Sends 2Kb Json Body to an MVC controller
Profiles:
aspnet-perf-lin: INTEL/Linux 12 Coresaspnet-perf-win: INTEL/Windows 12 Coresaspnet-citrine-lin: INTEL/Linux 28 Coresaspnet-citrine-win: INTEL/Windows 28 Coresaspnet-citrine-arm: ARM/Linux 32 Coresaspnet-citrine-amd: AMD/Linux 48 Cores
Components:
kestrelmvc
Http.sys isn't in this list. I recently did a perf change to http.sys and had to build locally and run a benchmark. We should add the JSON one.
cc @sebastienros @adityamandaleeka @Tratcher
BTW is there an easy way to track alloc/free using counters? Would it even be useful? Or would you switch to native profilers at that point?
You would switch to native profilers to profile unmanaged memory allocators.
BTW is there an easy way to track alloc/free using counters? Would it even be useful? Or would you switch to native profilers at that point?
The VS profiler has an unmanaged memory allocation tracker. /cc @karpinsn
Fixes/helps https://github.com/dotnet/aspnetcore/issues/40215?
Fixes/helps #40215?
Oh wow. Yes! So glad there is an issue filed for this. I'll mark this PR as fixing that.
I think this is a good first pass to measure performance. Some potential changes once we get through this:
- Managed pointers
- Stackalloc for known headers when under a certain size
- Stackalloc for data chunks when under a certain size
Benchmark for current changes.
| application | baseline | Current | |
| --------------------------------------- | ------------------------ | ------------------------ | ------- |
| CPU Usage (%) | 64 | 67 | +4.69% |
| Cores usage (%) | 1,788 | 1,880 | +5.15% |
| Working Set (MB) | 484 | 486 | +0.41% |
| Private Memory (MB) | 514 | 530 | +3.11% |
| Build Time (ms) | 6,666 | 3,886 | -41.70% |
| Start Time (ms) | 328 | 344 | +4.88% |
| Published Size (KB) | 118,659 | 118,659 | 0.00% |
| .NET Core SDK Version | 8.0.100-alpha.1.22514.14 | 8.0.100-alpha.1.22514.14 | |
| Max CPU Usage (%) | 67 | 65 | -2.51% |
| Max Working Set (MB) | 501 | 508 | +1.40% |
| Max GC Heap Size (MB) | 255 | 257 | +0.73% |
| Size of committed memory by the GC (MB) | 452 | 456 | +0.77% |
| Max Number of Gen 0 GCs / sec | 8.00 | 7.00 | -12.50% |
| Max Number of Gen 1 GCs / sec | 2.00 | 2.00 | 0.00% |
| Max Number of Gen 2 GCs / sec | 1.00 | 1.00 | 0.00% |
| Max Time in GC (%) | 1.00 | 1.00 | 0.00% |
| Max Gen 0 Size (B) | 36,651,360 | 41,094,992 | +12.12% |
| Max Gen 1 Size (B) | 6,313,496 | 8,741,000 | +38.45% |
| Max Gen 2 Size (B) | 1,511,712 | 1,463,744 | -3.17% |
| Max LOH Size (B) | 113,184 | 113,184 | 0.00% |
| Max POH Size (B) | 1,773,536 | 825,936 | -53.43% |
| Max Allocation Rate (B/sec) | 1,687,780,712 | 1,760,190,512 | +4.29% |
| Max GC Heap Fragmentation | 90 | 91 | +1.19% |
| # of Assemblies Loaded | 88 | 88 | 0.00% |
| Max Exceptions (#/s) | 0 | 0 | |
| Max Lock Contention (#/s) | 1,078 | 1,400 | +29.87% |
| Max ThreadPool Threads Count | 48 | 48 | 0.00% |
| Max ThreadPool Queue Length | 178 | 134 | -24.72% |
| Max ThreadPool Items (#/s) | 1,319,545 | 1,130,717 | -14.31% |
| Max Active Timers | 0 | 0 | |
| IL Jitted (B) | 117,381 | 156,942 | +33.70% |
| Methods Jitted | 1,315 | 2,083 | +58.40% |
| load | baseline | Current | |
| ---------------------- | --------- | --------- | ------- |
| CPU Usage (%) | 40 | 41 | +2.50% |
| Cores usage (%) | 1,108 | 1,149 | +3.70% |
| Working Set (MB) | 49 | 48 | -2.04% |
| Private Memory (MB) | 376 | 376 | 0.00% |
| Start Time (ms) | 0 | 0 | |
| First Request (ms) | 215 | 243 | +13.02% |
| Requests/sec | 401,248 | 423,907 | +5.65% |
| Requests | 6,058,808 | 6,400,799 | +5.64% |
| Mean latency (ms) | 0.69 | 0.62 | -10.17% |
| Max latency (ms) | 24.05 | 22.32 | -7.19% |
| Bad responses | 0 | 0 | |
| Socket errors | 0 | 0 | |
| Read throughput (MB/s) | 63.52 | 67.11 | +5.65% |
| Latency 50th (ms) | 0.58 | 0.56 | -3.76% |
| Latency 75th (ms) | 0.78 | 0.74 | -4.77% |
| Latency 90th (ms) | 0.99 | 0.96 | -3.03% |
| Latency 99th (ms) | 2.20 | 1.72 | -21.82% |
/cc @davidfowl
Benchmarks against .NET 8
| application | baseline | Current | |
| --------------------------------------- | ------------------------ | ------------------------ | ------- |
| CPU Usage (%) | 66 | 67 | +1.52% |
| Cores usage (%) | 1,852 | 1,883 | +1.67% |
| Working Set (MB) | 479 | 473 | -1.25% |
| Private Memory (MB) | 505 | 499 | -1.19% |
| Build Time (ms) | 6,445 | 4,194 | -34.93% |
| Start Time (ms) | 334 | 355 | +6.29% |
| Published Size (KB) | 115,430 | 115,430 | 0.00% |
| .NET Core SDK Version | 8.0.100-alpha.1.22514.14 | 8.0.100-alpha.1.22514.14 | |
| Max CPU Usage (%) | 68 | 68 | +0.68% |
| Max Working Set (MB) | 502 | 494 | -1.57% |
| Max GC Heap Size (MB) | 312 | 332 | +6.19% |
| Size of committed memory by the GC (MB) | 451 | 450 | -0.25% |
| Max Number of Gen 0 GCs / sec | 6.00 | 6.00 | 0.00% |
| Max Number of Gen 1 GCs / sec | 1.00 | 1.00 | 0.00% |
| Max Number of Gen 2 GCs / sec | 0.00 | 0.00 | |
| Max Time in GC (%) | 1.00 | 1.00 | 0.00% |
| Max Gen 0 Size (B) | 46,443,032 | 25,880,352 | -44.28% |
| Max Gen 1 Size (B) | 6,664,304 | 6,506,072 | -2.37% |
| Max Gen 2 Size (B) | 1,357,992 | 1,348,240 | -0.72% |
| Max LOH Size (B) | 113,184 | 113,184 | 0.00% |
| Max POH Size (B) | 839,320 | 822,840 | -1.96% |
| Max Allocation Rate (B/sec) | 1,858,940,728 | 1,778,665,328 | -4.32% |
| Max GC Heap Fragmentation | 92 | 91 | -1.51% |
| # of Assemblies Loaded | 88 | 88 | 0.00% |
| Max Exceptions (#/s) | 0 | 0 | |
| Max Lock Contention (#/s) | 1,390 | 1,306 | -6.04% |
| Max ThreadPool Threads Count | 48 | 48 | 0.00% |
| Max ThreadPool Queue Length | 138 | 133 | -3.62% |
| Max ThreadPool Items (#/s) | 1,190,411 | 1,173,617 | -1.41% |
| Max Active Timers | 0 | 0 | |
| IL Jitted (B) | 116,702 | 154,697 | +32.56% |
| Methods Jitted | 1,316 | 2,064 | +56.84% |
| load | baseline | Current | |
| ---------------------- | --------- | --------- | ------- |
| CPU Usage (%) | 42 | 42 | 0.00% |
| Cores usage (%) | 1,186 | 1,180 | -0.51% |
| Working Set (MB) | 49 | 49 | 0.00% |
| Private Memory (MB) | 376 | 376 | 0.00% |
| Start Time (ms) | 0 | 0 | |
| First Request (ms) | 240 | 239 | -0.42% |
| Requests/sec | 431,565 | 433,502 | +0.45% |
| Requests | 6,516,506 | 6,545,631 | +0.45% |
| Mean latency (ms) | 0.75 | 0.66 | -11.59% |
| Max latency (ms) | 23.46 | 32.78 | +39.73% |
| Bad responses | 0 | 0 | |
| Socket errors | 0 | 0 | |
| Read throughput (MB/s) | 68.32 | 68.63 | +0.45% |
| Latency 50th (ms) | 0.55 | 0.55 | +0.37% |
| Latency 75th (ms) | 0.73 | 0.73 | -0.82% |
| Latency 90th (ms) | 0.97 | 0.95 | -2.06% |
| Latency 99th (ms) | 8.44 | 2.81 | -66.71% |
/cc @davidfowl
Benchmarks against .NET 8
| application | baseline | Current | |
| --------------------------------------- | ------------------------ | ------------------------ | ------- |
| CPU Usage (%) | 65 | 72 | +10.77% |
| Cores usage (%) | 1,816 | 2,014 | +10.90% |
| Working Set (MB) | 469 | 472 | +0.64% |
| Private Memory (MB) | 495 | 498 | +0.61% |
| Build Time (ms) | 3,746 | 3,667 | -2.11% |
| Start Time (ms) | 382 | 360 | -5.76% |
| Published Size (KB) | 115,434 | 115,434 | 0.00% |
| .NET Core SDK Version | 8.0.100-alpha.1.22514.14 | 8.0.100-alpha.1.22514.14 | |
| Max CPU Usage (%) | 64 | 75 | +16.34% |
| Max Working Set (MB) | 489 | 494 | +1.01% |
| Max GC Heap Size (MB) | 333 | 334 | +0.29% |
| Size of committed memory by the GC (MB) | 441 | 448 | +1.60% |
| Max Number of Gen 0 GCs / sec | 6.00 | 6.00 | 0.00% |
| Max Number of Gen 1 GCs / sec | 1.00 | 1.00 | 0.00% |
| Max Number of Gen 2 GCs / sec | 0.00 | 0.00 | |
| Max Time in GC (%) | 1.00 | 0.00 | |
| Max Gen 0 Size (B) | 46,476,784 | 33,895,144 | -27.07% |
| Max Gen 1 Size (B) | 6,111,888 | 6,507,688 | +6.48% |
| Max Gen 2 Size (B) | 1,354,112 | 1,341,880 | -0.90% |
| Max LOH Size (B) | 113,184 | 113,184 | 0.00% |
| Max POH Size (B) | 835,200 | 826,960 | -0.99% |
| Max Allocation Rate (B/sec) | 1,837,387,536 | 1,743,068,656 | -5.13% |
| Max GC Heap Fragmentation | 94 | 91 | -3.98% |
| # of Assemblies Loaded | 88 | 88 | 0.00% |
| Max Exceptions (#/s) | 0 | 0 | |
| Max Lock Contention (#/s) | 1,357 | 1,285 | -5.31% |
| Max ThreadPool Threads Count | 48 | 48 | 0.00% |
| Max ThreadPool Queue Length | 137 | 134 | -2.19% |
| Max ThreadPool Items (#/s) | 1,154,303 | 1,122,875 | -2.72% |
| Max Active Timers | 0 | 0 | |
| IL Jitted (B) | 157,134 | 154,667 | -1.57% |
| Methods Jitted | 2,082 | 2,064 | -0.86% |
| load | baseline | Current | |
| ---------------------- | --------- | --------- | ------- |
| CPU Usage (%) | 42 | 42 | 0.00% |
| Cores usage (%) | 1,185 | 1,165 | -1.69% |
| Working Set (MB) | 49 | 49 | 0.00% |
| Private Memory (MB) | 376 | 376 | 0.00% |
| Start Time (ms) | 0 | 0 | |
| First Request (ms) | 235 | 237 | +0.85% |
| Requests/sec | 429,490 | 423,980 | -1.28% |
| Requests | 6,484,821 | 6,401,831 | -1.28% |
| Mean latency (ms) | 0.70 | 0.67 | -4.55% |
| Max latency (ms) | 22.15 | 37.66 | +70.02% |
| Bad responses | 0 | 0 | |
| Socket errors | 0 | 0 | |
| Read throughput (MB/s) | 67.99 | 67.12 | -1.28% |
| Latency 50th (ms) | 0.55 | 0.56 | +1.64% |
| Latency 75th (ms) | 0.73 | 0.75 | +1.63% |
| Latency 90th (ms) | 0.97 | 0.99 | +2.06% |
| Latency 99th (ms) | 5.50 | 2.83 | -48.55% |
LGTM!
@davidfowl @Tratcher @JamesNK Please take another look. The CI is all green.
So, it's a bit slower or a wash and the GC numbers look about the same. 🤔
Sharing offline comments I made based on collection of a nettrace for GCHandle usage.
I'm convinced on this change. The logging itself is a win. The number of GCHandle events is staggering. I did a quick experiment locally. Instead of using NativeMemory.Alloc and NativeMemory.Free, I replaced them with ArrayPool<byte>.Shared.Rent and a pinned GCHandle. The trace resulted in precisely the same experience before my larger changes. This is a win and the performance characteristics that we've improved aren't reflected in the crank base statistics. They are however reflected in nettrace.
The nettrace impact is prior to this change the nettrace was 1GB, but after this change it is approximately 40MB. The win here is reduction of the impact of GCHandle events.
@sebastienros I want to assign this to you. Once we have benchmarks running on .NET 8 we should merge this.
@davidfowl we are tracking net8.0 now (check the charts) so I assume that's what you wanted before we merge the PR.
Yep
The nettrace impact is prior to this change the nettrace was 1GB, but after this change it is approximately 40MB. The win here is reduction of the impact of
GCHandleevents.
Could you share what nettrace args are needed to show this? I'm thinking ahead to next year when we write a ASP.NET Core 8 perf improvements blog post and it's useful to have info on the PR so we don't need to figure out how to repro things a year later.
@BrennanConroy This is easy to see by adding the following arguments to the crank command: --application.dotnetTrace true --application.dotnetTraceProviders gc-verbose. Running scenarios where a lot of GCHandles instance are created/destroyed will saturate the pipeline and result in dropped events or at best, very large nettrace files.
Hi @AaronRobinsonMSFT. It looks like you just commented on a closed PR. The team will most probably miss it. If you'd like to bring something important up to their attention, consider filing a new issue and add enough details to build context.